I am aware of the textbook method where we multiply the mantissa by 2 , take its integer part as the next bit, multiply the fractional part by 2 and repeat, until we get zero or reach our desired precision.
Is there an efficient algorithm to convert mantissa from base 10 to base 2 than the above mentioned algorithm?
The algorithm you proposed runs in O(n) time where n is the number of bits desired. We cannot do better than this, because we have to somehow calculate all n desired bits of the output, so any algorithm must be at least O(n), or else the output could not possibly contain all the desired information.
Related
To get a binary representation from a natural number like 20, we divide this number by 2 and so on until we cannot divide by 2 anymore. To get a binary representation from a decimal number like 0.4512, we multiply this number by 2 repeated times.
What is the logic explanation why with these two systems we get a binary representation?
Thanks
It is based on the fact that numbers are coded in binary.
If the number A is an integer, A is rewritten as A=∑i=0n-1ai×2i=an-1×2n-1+an-2×2n-2+...+a1×2+a0
where ai=0 or 1.
It is easy to see that is A is even, a0=0, and if it is odd, a0=1. So we already have the least significant bit a0.
Now, if we divide A by two, a0 disappears and we have
A/2=an-1×2n-2+an-2×2n-3+...+a2×2+a1
We can determine this way a1 depending on the parity of A/2. and we continue, we get all the bits of A.
Fractional numbers are expressed according to negative powers of 2. If A=0.a-1a-2...a-n, A=a-1/2+a-2/4+...+a-n/2^n
If we multiply it by two, 2×A=a-1+a-2/2+...+a-n/2^n-1. If 2×A≥1, we must have a-1=1, otherwise a-1=0. And we can determine other bits is a similar way by successive multiplications by two.
I'm trying to understand how floating point number arithmetic plays a role in computer science when using the binary system. I came across an excerpt from What Every Computer Scientist Should Know About Floating-Point Arithmetic which defines normalized numbers as unique floating-point numbers with the leading significand being non-zero. It goes on to say...
When β
= 2, p = 3, e min = -1 and e max = 2 there are 16 normalized floating-point numbers, as shown in Figure D-1.
Where β is the base, p is the precision, e min is the minimum exponent, and e max is the maximum exponent.
My attempt at understanding how he came to the conclusion of there being 16 normalized floating-point numbers was to multiply together the possible number of significands β^p and the possible number of exponents e max - e min + 1. My result was 32 possible normalized floating-point values. I am unsure of how to get the correct result of 16 normalized floating-point values as was declared in the paper above. I assumed negative floating-point values were excluded, however, I did not include them in my calculations.
This question is more geared toward mathematical formulae. But it will help me to better understand how floating-point arithmetic works in computer science.
I would like to know how to get the correct result of 16 normalized floating-point numbers and why.
Since the first bit is always 1, with 3 bits for the mantissa you have only two bits to vary, yielding 4 different mantissa values. Combined with 4 different exponent values that's 16. I haven't looked at the paper though.
My attempt at understanding how he came to the conclusion of there being 16 normalized floating-point numbers was to multiply together the possible number of significands β^p and the possible number of exponents e max - e min + 1
This is correct except that the number of possible significands is not βp in binary with an implicit leading 1. In these conditions, the number of possible significands is βp-1, encoded over p-1 bits.
In other words, the missing values for the possible significands have already been taken advantage of when the encoding reserved, say, 52 bits to encode a precision of 53 binary digits.
If I wanted to represent -2455.1152 as 32 bit I know the first bit is 1 (negative sign) but I can get the 2455 to binary as 10010010111 but for the fractional part I'm not too sure. .1152 could have an infinite number of fractional parts. Would that mean that only up to 23 bits are used to represent the fractional part? So since 2445 uses 11 bits, bits 11 to 0 are for the fractional part?
for the binary representation I have 10010010111.00011101001. Exponent is 10. 10+127=137. 137 as binary is 10001001.
full representation would be:
1 10001001 1001001011100011101001
is that right?
It looks like you are trying to devise your own floating-point representation, but you used a fixed-point tag so I will explain how to convert your real number to a traditional fixed-point representation. First, you need to decide how many bits will be used to represent the fractional part of the number. Just for the sake of discussion let's say that 16 bits will be used for the fractional part, 15 bits for the integer part, and one bit reserved for the sign bit. Now, multiply the absolute value of the real number by 2^{16}: 2455.1152 * 65536 = 160898429.747. You can either round to the nearest integer or just truncate. Suppose we just truncate to 160898429. Converting this to hexadecimal we get 0x09971D7D. To make this negative, invert and add a 1 to the LSB, and the final result is 0xF668E283.
To convert back to a real number just reverse the process. Take the absolute value of the fixed-point representation and divide by 2^{16}. In this case we would find that the fixed-point representation is equal to the real number -2455.1151886 . The accuracy can be improved by rounding instead of truncating when converting from real to fixed-point, or by allowing more bits for the fractional part.
I have faced an interview question related to embedded systems and C/C++. The question is:
If we multiply 2 signed (2's complement) 16-bit data, what should be the size of resultant data?
I've started attempting it with an example of multiplying two signed 4-bit, so, if we multiply +7 and -7, we end up with -49, which requires 7 bits. But, I could not formulate a general relation.
I think I need to understand binary multiply deeply to solve this question.
First, n bits signed integer contains a value in the range -(2^(n-1))..+(2^(n-1))-1.
For example, for n=4, the range is -(2^3)..(2^3)-1 = -8..+7
The range of the multiplication result is -8*+7 .. -8*-8 = -56..+64.
+64 is more than 2^6-1 - it is 2^6 = 2^(2n-2) ! You'll need 2n-1 bits to store such POSITIVE integer.
Unless you're doing proprietary encoding (see next paragraph) - you'll need 2n bits:
One bit for the sign, and 2n-1 bits for the absolute value of the multiplication result.
If M is the result of the multiplication, you can store -M or M-1 instead. this can save you 1 bit.
This will depend on context. In C/C++, all intermediates smaller than int are promoted to int. So if int is larger than 16-bits, then the result will be a signed 32-bit integer.
However, if you assign it back to a 16-bit integer, it will truncate leaving only bottom 16 bits of the two's complement of the new number.
So if your definition of "result" is the intermediate immediately following the multiply, then the answer is the size of int. If you define the size as after you've stored it back to a 16-bit variable, then answer is the size of the 16-bit integer type.
In our project, we take two floats from the user, store them in integer registers, and treat them as a IEEE 754 single precision floats, manipulating the bits by masking. So after I multiply the 23 bits of fraction value, should I take into account the result placed in the LO register if I want to return a single precision float (32 bits) as the product?
First off, I hope you mean 24 bits of value, since you'll need to include the implicit mantissa bit in your multiplication.
Second, if you you want your multiplication to be correctly rounded, as in IEEE-754, you will (sometimes) need the low part of the multiply in order to deliver the correct rounded result.
On the other hand, if you don't need to implement correct rounding, and you left-shift your fraction bits before multiplication, you will be able to ignore the low word of the result.