Encoding Numbers into IEEE754 half precision

Encoding Numbers into IEEE754 half precision - binary

I have a quick question about a problem I'm trying to solve. For this problem, I have to convert (0.0A)16 into IEEE754 half precision floating point standard. I converted it to binary (0000.0000 1010), normalized it (1.010 * 2^5), encoded the exponent (which came out to be 01010), but now I'm lost on how to put it into the actual form. What should I do with the fractional part? The answer comes out to be 0 01010 01 0000 0000.
I know there's something to do with adding a omit 1, but I'm not entirely sure on where that happens either.
Any help is appreciated!

The 1 you have to omit is the first one of the mantissa, since we know the significant part always starts with 1 (this way, IEEE-754 gains one bit of space). The mantissa is 1.010, so you will represent only "010".
The solution 0 01010 0100000000 means:
0 is the sign;
01010 is the exponent;
01000000 is the mantissa, omitting the first one.

Related

How do I round this binary number to the nearest even

I have this binary representation of 0.1:
0.00011001100110011001100110011001100110011001100110011001100110
I need to round it to the nearest even to be able to store it in the double precision floating point. I can't seem to understand how to do that. Most tutorials talk about guard, round and sticky bits - where are they in this representation?
Also I've found the following explanation:
Let’s see what 0.1 looks like in double-precision. First, let’s write
it in binary, truncated to 57 significant bits:
0.000110011001100110011001100110011001100110011001100110011001…
Bits 54 and beyond total to greater than half the value of bit
position 53, so this rounds up to
0.0001100110011001100110011001100110011001100110011001101
This one doesn't talk about GRS bits, why? Aren't they always required?

The text you quote is from my article Why 0.1 Does Not Exist In Floating-Point . In that article I am showing how to do the conversion by hand, and the "GRS" bits are an IEEE implementation detail. Even if you are using a computer to do the conversion, you don't have to use IEEE arithmetic (and you shouldn't if you want to do it correctly ), so the GRS bits won't come into play there either. In any case, the GRS bits apply to calculations, not really to the conceptual idea of conversion.

Need Helped Understanding an 8-Bit Signed Decimal with 2's Compliment

I need help in determining if my logic here is right or wrong.
Example Question
"Assuming I have an 8-bit signed decimal value of 200 in two's compliment form..."
My Thought Process
Now because it is 8-bits and is signed, the most significant bit must be reserved for the sign.
Thus, the maximum positive value it can have is:
2^(8-1) - 1 = 127
At first I was confused because I thought, why is the question stating that 200 is able to be 8-bits and signed? Then I thought, that's where the two's compliment statement comes into question.
Because it is two's compliment in reality, this is the case:
8-bit Signed, 2's Compliment, Decimal = 200
Convert to Binary --> 1100 1000
Because it is signed, the actual two's compliment number is ACTUALLY -56 (I would use negating methods to invert the 1's and 0's then + 1, but for the interest of time, I just found a converter online).
So my conclusion is:
8-bit Signed, 2's Compliment, Decimal value of 200 is actually -56.
Ultimate Question
Is my thought process correct with this? If so, I think the most confusing part about this is telling my brain that one number is equal to a completely different number.

Yes, I think your analysis is correct.
To expand a bit more, I think the wording of the question is awkward and would have been better stated as "What is the value of 1100 1000 in base 10, where the number is a two's complemented number?"
The trick here is to think not that 200 == -56, but that the single point of truth is the bits 11001000. These bits of numbers have no meaning by themselves. We have the computer interpret them differently based on the program. So two's complement (with 8 bit numbers) treats that as -56, an unsigned interpretation would treat that as 200, and in ASCII this would be some special character depending on the encoding.

Do we ignore overflow in Two's Complement

I'm trying to wrap my head around overflow within twos complement for example say I'm trying to take away these two binary numbers:
1111 1000 0100 - 010 111 001 000
I convert the 2nd binary number to it's two complement equivalent and then simply add it but I noticed it resulted in an overflow of 1, do I simply ignore the overflow? or is there a rule I must follow
1111 1000 0100 + 1010 0011 1000 = (1) 1001 1011 1100

Short answer:
if you are performing arithmetic on fixed-width binary numbers, using two's complement representation for negative numbers, then yes, you ignore the one-bit overflow.
Long Answer:
You can consider each ith bit in n-bit two's complement notation have place value 2^i, for 0 <= i < n - 1, with bit n - 1 (the sign bit) having place value -2^(n - 1). That's a negative place value for the sign bit. If you compute the sum of two such numbers as if they were unsigned n-bit binary numbers, these cases are fine:
the sign bit is not set in the either addend or in the result (reinterpreted as being in two's-complement representation),
the sign bit is set in exactly one of the addends, regardless of overflow (which is ignored if it occurs), or
the sign bit is set in both addends (therefore there is an overflow, which is ignored) and in the result.
To understand that, it may be easier to think about the problem as two separate sums: a sum of the sign bits, and a sum of the value (other) bits. An overflow of the value sum yields an overflow bit whose place value is 2^(n-1) -- exactly the inverse of the place value of a sign bit -- therefore such an overflow cancels one sign bit.
The negative + negative case requires such a cancellation for the result to be representable (two sign bits + one value overflow = one sign bit), and the positive + positive case cannot accommodate such a cancellation because there is no sign bit available to be cancelled. In the positive + negative case, there is an overflow of the value-bit sum in exactly those cases where the result is non-negative; you can consider that to cancel the sign bit of the negative addend, wich yields the same result as ignoring the overflow of the overall unsigned sum, and reinterpreting the sum as a two's complement number.
The remaining cases yield mathematical results that cannot be represented in n-bit two's complement format -- either greater than the largest representable number, or less than the smallest. If you ignore overflow then such results can be recognized by an apparent sign flip. What you do with that is a question of error recovery strategy.

From Wikipedia's article on 2's complement in the section on addition at https://en.wikipedia.org/wiki/Two%27s_complement#Addition, my understanding is that carry beyond the given (fixed) bit length (to the left) can be ignored but not overflow as determined when the leftmost two bits of the carry are different. The article shows how to maintain a carry row so as to tell if there was overflow and here is a simple example in the same style:
In 4 bit 2's complement -2 is 1110 and +3 is 0011 so
11110 carry
1110 -2
+0011 +3
----
10001 which is 0001 or simply 1 ignoring the carry in bit 5 and is
safe since the leftmost two bits in the carry row are identical

Although this is a very old question, it comes up frequently. In two's complement addition, a carry out from the leftmost digit is discarded. Why? Although not precisely correct mathematically, it is easiest to think of a two’s complement number as having a sign bit on the left and value bits elsewhere. The only way a carry out of the sign bit can occur is if the sign bits of both addends were one (negative) and there was a carry in to the sign bit. In that case, the sign bit of the result will be one, which is correct. A problem occurs if the carry in to the sign bit is different from the carry out. That causes an incorrect sign bit, which is an overflow condition. That can be detected without referring to the carry out from the sign bit because the sign of the result will be wrong. For example, if two positive numbers are added and the result is negative, something is wrong. The something that’s wrong is that the sum of the value bits has overflowed into the sign bit and the result is in error.
With pen and paper arithmetic, it is usual to discard the carry and check that the sign of the result is correct. In electronic circuits, the easiest way is to compare that carry in to the carry out with an XOR and signal an error if they differ. The carry out is not otherwise used or stored.

Distinguishing between signed binary values

Take the binary representation of 8 decimal: 0000 1000. Using two's complement, find the opposite by switching all the bits and adding one: 1111 1000. Now we have a binary representation for -8 decimal.
But how do we know whether to interpret this in decimal as -8 or 248?

When somebody writes down a binary number they usually specify whether it's signed or unsigned. If they don't specify anything you can assume that it's unsigned, i.e. 248 in this case.

the sign bit, far left bit, most significant bit, is 1, that means it's a negative number.
and if you have 8 bits, you can only get -128 to 127 (those 256 different values). So the highest positive number is 01111111 You can't get above 127. So that's how you know.
note- that far left bit is still called a sign bit even though it's not the sign bit of sign and magnitude representation. It holds value besides the sign, but it does show the sign.
wikipedia twos complement
"The most significant bit determines the sign of the number and is sometimes called the sign bit. Unlike in sign-and-magnitude representation, the sign bit also has the weight..".
On a slight tangent - I'd add the following, that as a shorthand to quickly do 2s complement, here's an example, if you have to put 0101 in 2s complement, go from the right hold the first 1. Then flick the rest. so hold the 1 on the far right and flick the rest. 0101 becomes 1011. Notice that is the same as the non-shortcut way of 0101 inverted to 1010 and adding 1 so 1011. And when you have 1011 while you can subtract one and invert and get 0101. And you can invert and add 1 and you get 0101. You can also use the same same technique of holding the first 1 on the far right and flicking the rest, and you get back to 0101. This works for any of them. 0110 in 2s complement, hold the 10 on the far right, flick those to the left of it, you get 1010.
You could ask, how do you know it's in 2s complement, in the sense of, how do you know what number format the number is stored in. Whether it's 2s complement or 1s complement. Or sign magnitude or floating point. Well, you have to know 'cos you stored it! You can't store data and not remember what the data means!

how to represent floating points in binary?

I have been working on these three lab questions for about 5 hours. I'm stuck on the last question.
Consider the following floating point number representation which
stores a floating point number is 16 bits. You have a sign-bit, a six
bit (excess-32) exponent, and a nine bit mantissa.
Explain how the 9-bit mantissa might get you into trouble.
Here is the preceding question. Not sure if it will help in analysis.
What is the range of exponents it supports?
000000 to 111111 or 0 to 63 where exponent values less
than 32 are negative, and exponent values greater than 32 are
positive.
I have a pretty good foundation for floating points and converting between decimals and floating points. Any guidance would be greatly appreciated.

To me, the ratio mantissa to exponent is a bit off. Even if we assume there is a hidden bit, effectively making this into a 10 bit mantissa (with top bit always set), you can represent + or - 2^31, but in 2^31/2^10 = 2^21 steps (i.e. steps of 2097152).
I'd rather use 11 bits mantissa and 5 bit exponent, making this 2^15/2^11 = 2^4, i.e. steps of 16.
So for me the trouble would be that 9+1 bits is simply too imprecise, compared to the relatively large exponent.

My guess is that a nine-bit mantissa simply provides far too little precision, so that any operation apart from trivial ones, will make the calculation far too inexact to be useful.
I admit this answer is a little bit far-fetched, but apart from this, I can't see a problem with the representation.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008