I am writing a generic routine for converting fixed-point numbers between decimal and binary representations.
For positive numbers the processing is simple, however when things come to negative ones I found divergent sources. Someone says there is a single bit used to hold the sign while others say the whole number should be represented in a pseudo integer using 2's complement even it is negative.
Please anyone tell me which source is correct or is there a standard representation for signed fixed point numbers?
Additionally, if the 2's complement representation was correct then how to represent negative numbers with zero integer part. For example -0.125?
Fixed-point numbers are just binary values where the place values have been changed. Assigning place values to the bits is an arbitrary human activity, and we can do it in any way that makes sense. Normally we talk about binary integers so it is convenient to assign the place value 2^0 = 1 to the LSB, 2^1=2 to the bit to the left of the LSB, and so on. For an N bit integer the place value of the MSB becomes 2^(N-1). If we want a two's-complement representation, we change the place value of the MSB to -2^(N-1) and all of the other bit place values are unchanged.
For fixed-point values, if we want F bits to represent a fractional part of the number, then the place value of the LSB becomes 2^(0-F)
and the place value of the MSB becomes 2^(N-1-F) for unsigned numbers and -2^(N-1-F) for signed numbers.
So, how would we represent -0.125 in a two's-complement fixed-point value? That is equal to 0.875 - 1, so we can use a representation where the place value of the MSB is -1 and the value of all of the other bits adds up to 0.875. If you choose a
4-bit fixed-point number with 3 fraction bits you would say that
1111 binary equals -0.125 decimal. Adding up the place values of the bits we have (-1) + 0.5 + 0.25 + 0.125 = -0.125. My personal preference is to write the binary number as 1.111 to note which bits are fraction and which are integer.
The reason we use this approach is that the normal integer arithmetic operators still work.
It's easiest to think of fixed-point numbers as scaled integers — rather than shifted integers. For a given fixed-point type, there is a fixed scale which is a power of two (or ten). To convert from the real value to the integer representation, multiply by that scale. To convert back again, simply divide. Then the issue of how negative values are represented becomes a detail of the integer type with which you are representing your number.
Please anyone tell me which source is correct...
Both are problematic.
Your first source is incorrect. The given example is not...
the same as 2's complement numbers.
In two’s complement, the MSB's (most significant bit's) weight is negated but the other bits still contribute positive values. Thus a two’s complement number with all bits set to 1 does not produce the minimum value.
Your second source could be a little misleading where it says...
shifting the bit pattern of a number to the right by 1 bit always divide the number by 2.
This statement brushes over the matter of underflow that occurs when the LSB (least significant bit) is set to 1, and the resultant rounding. Right-shifting commonly results in rounding towards negative infinity while division results in rounding towards zero (truncation). Both produce the same behavior for positive numbers: 3/2 == 1 and 3>>1 == 1. For negative numbers, they are contrary: -3/2 == -1 but -3>>1 == -2.
...is there a standard representation for signed fixed point numbers?
I don't think so. There are language-specific standards, e.g. ISO/IEC TR 18037 (draft). But the convention of scaling integers to approximate real numbers of predetermined range and resolution is well established. How the underlying integers are represented is another matter.
Additionally, if the 2's complement representation was correct then how to represent negative numbers with zero integer part. For example -0.125?
That depends on the format of your integer and your choice of radix. Assuming a 16-bit two’s complement number representing binary fixed-point values, the scaling factor is 2^15 which is 32,768. Multiply the value to store as an integer: -0.125*32768. == -4096 and divide to retrieve it: -4096/32768. == -0.125.
Related
I'm currently learning about IEEE754 standard and rounding, and I have an exercise which is the following:
Add -325.875 to 0.546875 in IEEE754, but with 3 bits dedicated to the mantissa instead of 23.
I'm having a lot of trouble doing this, especially representing the intermediary values, and the guard/round/sticky bits. Can someone give me a step-by-step solution, to the problem?
My biggest problem is that obviously I can't represent 0.546875 as 0.100011 as that would have more precision than the system has. So how would that be represented?
Apologies if the wording is confusing.
Preliminaries
The preferred term for the fraction portion of a floating-point number is “significand,” not “mantissa.” “Mantissa” is an old word for the fraction portion of a logarithm. Mantissas are logarithmic; adding to the mantissa multiplies the number represented. Significands are linear; adding to the significand adds to the number represented (as scaled by the exponent).
When working with a significand, use its mathematical precision, not the number of bits in the storage format. The IEEE-754 binary32 format has 23 bits in its primary field for the encoding of a significand, but another bit is encoded via the exponent field. Mathematically, numbers in the binary32 format behave as if they have 24 bits in their significands.
So, the task is to work with numbers with four bits in their significands, not three.
Work
In binary, −325.875 is −101000101.1112•2. In scientific notation, that is −1.010001011112•28. Rounding it to four bits in the significand gives −1.0102•28.
In binary, 0.546875 is .1000112. In scientific notation, that is 1.000112•2−1. Rounding it to four bits in the significand gives 1.0012•2−1. Note that the first four bits are 1000, but they are immediately followed by 11, so we round up. 1.00011 is closer to 1.001 than it is to 1.000.
So, in a floating-point format with four-bit significands, we want to add −1.0102•28 and 1.0012•2−1. If we adjust the latter number to have the same exponent as the former, we have −1.0102•28 and 0.0000000010012•28. To add those, we note the signs are different, so we want to subtract the magnitudes. It may help to line up the digits as we were taught in elementary school:
1.010000000000
0.000000001001
——————————————
1.001111110111
Thus, the mathematical result would be −1.0011111101112•28. However, we need to round the significand to four bits. The first four bits are 1001, but they are followed by 11, so we round up, producing 1010. So the final result is −1.0102•28.
−1.0102•28 is −1.25•28 = −320.
The MIPS multiply hardware stores the 64 bit product in 2 registers, HI and LO, What does it mean when the value in the HI Register is minus one (-1)?
The signed value -1 corresponds binary value 1111...111 (all ones) in A2 complement.
You should get familiar with A2 complement.
In such representation:
Positive integers are represented with a leading 0 (in the most significant bit) followed by the binary representation of the integer.
Negative integers are represented by applying A1 complement (that is, negate the binary representation of the absolute value of the integer and then adding one). In this case the most significant bit becomes 1.
By negating the binary representation of the absolute value of an integer, the number's leading 0s of the binary representation become 1s.
Therefore, if HI has a signed value -1 it means that it has all 1s (reading them in binary format). Thus, you can tell it's representing a negative number (the most significant bit is 1). In this particular case, the signed number gets stored in LO.
I'm reading this article about exponent bias in floating point numbers and it says the following:
n IEEE 754 floating point numbers, the exponent is biased in the
engineering sense of the word – the value stored is offset from the
actual value by the exponent bias. Biasing is done because exponents
have to be signed values in order to be able to represent both tiny
and huge values, but two's complement, the usual representation for
signed values, would make comparison harder. To solve this problem the
exponent is biased before being stored, by adjusting its value to put
it within an unsigned range suitable for comparison. By arranging the
fields so that the sign bit is in the most significant bit position,
the biased exponent in the middle, then the mantissa in the least
significant bits, the resulting value will be ordered properly,
whether it's interpreted as a floating point or integer value. This
allows high speed comparisons of floating point numbers using fixed
point hardware.
I've also found this explanation from wikipedia's article about offset binary:
This has the consequence that the "zero" value is represented by a 1
in the most significant bit and zero in all other bits, and in general
the effect is conveniently the same as using two's complement except
that the most significant bit is inverted. It also has the consequence
that in a logical comparison operation, one gets the same result as
with a two's complement numerical comparison operation, whereas, in
two's complement notation a logical comparison will agree with two's
complement numerical comparison operation if and only if the numbers
being compared have the same sign. Otherwise the sense of the
comparison will be inverted, with all negative values being taken as
being larger than all positive values.
I don't really understand what kind of comparison they are talking about here. Can someone please explain using a simple example?
'Comparison' here refers to the usual comparison of numbers by size: 5 > 4, etc. Suppose floating-point numbers were stored with as
[sign bit] [unbiased exponent] [mantissa]
For example, if the exponent is a 2's complement 3-bit binary number and the mantissa is a 4-bit unsigned binary number, you'd have
1 010 1001 = 4.5
1 110 0111 = 0.21875
You can see that the first is bigger than the second, but to figure this out, the computer would have to calculate 1.001 x 2^2 and 0.111 x 2^(-2) and then compare the resulting floating-point numbers. This is already complex with floating-point hardware, and if there is no such hardware for this computer, then...
So the number is stored as
[sign bit] [biased exponent] [mantissa]
Using the same 3-bit binary number for the exponent (this time biased; see a related question) and unsigned 4-bit mantissa, we have
1 101 1001 = 4.5
1 001 0111 = 0.21875
But now comparison is very easy! You can treat the two numbers as integers 11011001 and 10010111 and see that the first is obviously bigger: obvious even to a computer, as integer comparisons are easy. This is why biased exponents are used.
If I wanted to represent -2455.1152 as 32 bit I know the first bit is 1 (negative sign) but I can get the 2455 to binary as 10010010111 but for the fractional part I'm not too sure. .1152 could have an infinite number of fractional parts. Would that mean that only up to 23 bits are used to represent the fractional part? So since 2445 uses 11 bits, bits 11 to 0 are for the fractional part?
for the binary representation I have 10010010111.00011101001. Exponent is 10. 10+127=137. 137 as binary is 10001001.
full representation would be:
1 10001001 1001001011100011101001
is that right?
It looks like you are trying to devise your own floating-point representation, but you used a fixed-point tag so I will explain how to convert your real number to a traditional fixed-point representation. First, you need to decide how many bits will be used to represent the fractional part of the number. Just for the sake of discussion let's say that 16 bits will be used for the fractional part, 15 bits for the integer part, and one bit reserved for the sign bit. Now, multiply the absolute value of the real number by 2^{16}: 2455.1152 * 65536 = 160898429.747. You can either round to the nearest integer or just truncate. Suppose we just truncate to 160898429. Converting this to hexadecimal we get 0x09971D7D. To make this negative, invert and add a 1 to the LSB, and the final result is 0xF668E283.
To convert back to a real number just reverse the process. Take the absolute value of the fixed-point representation and divide by 2^{16}. In this case we would find that the fixed-point representation is equal to the real number -2455.1151886 . The accuracy can be improved by rounding instead of truncating when converting from real to fixed-point, or by allowing more bits for the fractional part.
I have faced an interview question related to embedded systems and C/C++. The question is:
If we multiply 2 signed (2's complement) 16-bit data, what should be the size of resultant data?
I've started attempting it with an example of multiplying two signed 4-bit, so, if we multiply +7 and -7, we end up with -49, which requires 7 bits. But, I could not formulate a general relation.
I think I need to understand binary multiply deeply to solve this question.
First, n bits signed integer contains a value in the range -(2^(n-1))..+(2^(n-1))-1.
For example, for n=4, the range is -(2^3)..(2^3)-1 = -8..+7
The range of the multiplication result is -8*+7 .. -8*-8 = -56..+64.
+64 is more than 2^6-1 - it is 2^6 = 2^(2n-2) ! You'll need 2n-1 bits to store such POSITIVE integer.
Unless you're doing proprietary encoding (see next paragraph) - you'll need 2n bits:
One bit for the sign, and 2n-1 bits for the absolute value of the multiplication result.
If M is the result of the multiplication, you can store -M or M-1 instead. this can save you 1 bit.
This will depend on context. In C/C++, all intermediates smaller than int are promoted to int. So if int is larger than 16-bits, then the result will be a signed 32-bit integer.
However, if you assign it back to a 16-bit integer, it will truncate leaving only bottom 16 bits of the two's complement of the new number.
So if your definition of "result" is the intermediate immediately following the multiply, then the answer is the size of int. If you define the size as after you've stored it back to a 16-bit variable, then answer is the size of the 16-bit integer type.