I'm in a basic Engineering class and we're going through binary conversions. I can figure out the base 10 to binary or hex conversions really well, however the 8bit floating point conversions are kicking my ass and I can't find anything online that breaks it down in a n00b level and shows the steps? Wondering if any gurus have found anything online that would be helpful for this situation.
I have questions like 00101010(8bfp) = what number in base 10
Whenever I want to remember how floating point works, I refer back to the wikipedia page on 32 bit floats. I think it lays out the concepts pretty well.
http://en.wikipedia.org/wiki/Single_precision_floating-point_format
Note that wikipedia doesn't know what 8 bit floats are, I think your professor may have invented them ;)
Binary floating point formats are usually broken down into 3 fields: Sign bit, exponent and mantissa. The sign bit is simply set to 1 if the entire number should be negative, and 0 if the number is positive. The exponent is usually an unsigned int with an offset, where 2 to the 0'th power (1) is in the middle of the range. It's simpler in hardware and software to compare sizes this way. The mantissa works similarly to the mantissa in regular scientific notation, with the following caveat: The most significant bit is hidden. This is due to the requirement of normalizing scientific notation to have one significant digit above the decimal point. Remember when your math teacher in elementary school would whack your knuckles with a ruler for writing 35.648 x 10^6 or 0.35648 x 10^8 instead of the correct 3.5648 x 10^7? Since binary only has two states, this required digit above the decimal point is always one, and eliminating it allows another bit of accuracy at the low end of the mantissa.
Related
For some backstory, I'm making a program that can do arithmetic on ones complement numbers. To do this I'm converting a binary string into a BigInteger and then performing the math using said BigIntegers, and then converting that back into a binary string. The only problem occurs when the end result goes below -127 or above +127 because I don't know how to correct it due to the nature of ones complement numbers. I was hoping I could somehow instead convert them like unsigned numbers and do like what this answer says to do.
There are also a couple of other questions that I got from reading the linked question. I put them in block quotes. I'm just asking for information on what they mean, and explain it to me.
Firstly
I know that the r-1 complement for r-base number should do end around carry if the highest bit has carry.
Secondly
End-around carry is actually rather simple: it changes the modulus of the addition operation from rn to rn–1.
And lastly
Again, let's keep the carry bit where it is. If you look at the numbers as unsigned integers, we're computing 13 + 11 = 24. However, due to the wrap-around carry, addition is done modulo 15, so we end up with 9, which represents -6 (the correct result).
If someone can explain these quotes to me and provide some web pages for me to read I would greatly appreciate it! :)
I have this binary representation of 0.1:
0.00011001100110011001100110011001100110011001100110011001100110
I need to round it to the nearest even to be able to store it in the double precision floating point. I can't seem to understand how to do that. Most tutorials talk about guard, round and sticky bits - where are they in this representation?
Also I've found the following explanation:
Let’s see what 0.1 looks like in double-precision. First, let’s write
it in binary, truncated to 57 significant bits:
0.000110011001100110011001100110011001100110011001100110011001…
Bits 54 and beyond total to greater than half the value of bit
position 53, so this rounds up to
0.0001100110011001100110011001100110011001100110011001101
This one doesn't talk about GRS bits, why? Aren't they always required?
The text you quote is from my article Why 0.1 Does Not Exist In Floating-Point . In that article I am showing how to do the conversion by hand, and the "GRS" bits are an IEEE implementation detail. Even if you are using a computer to do the conversion, you don't have to use IEEE arithmetic (and you shouldn't if you want to do it correctly ), so the GRS bits won't come into play there either. In any case, the GRS bits apply to calculations, not really to the conceptual idea of conversion.
How do I write 0xFA in signed mantissa. I converted it to binary = 1111_1010. Not sure where to go from here.
The question is "If the register file has 8 bits width total, write the following in signed mantissa."
Also, an explanation of signed mantissa would be great!
So what you have to work with is a byte of data with an unknown type, apparently.
In order to write a number in signed mantissa (see Significand) one would expect that your dealing with a floating point type such as single or double. However you've only got a single byte.
A single is 8 bytes so surely it can't be that and double is double trouble. Also a half requires 16 bits. The only logical alternative type would be SByte but in that case you will never get any numbers that have any mantissa (significant digits) after the decimal. In fact there is no decimal. So Perhaps this is a trick question?
If you go on the assumption of SByte, you get -6x10^0
Just in case you want proof, or if your curious how this looks during debug:
private void SByte2Dec()
{
sbyte convertsHexToSByte = Convert.ToSByte("0xFA", 16);
Single yourAnswer = Convert.ToSingle(convertsHexToSByte);
label1.Text = Convert.ToString(youranswer);
}
In this example I had a windows form with nothing but label1 on it.
Then I put SByte2Dec(); right under InitializeComponent();
The solution is -122. Not sure how to get there...any ideas?
Working backwards from the answer it's simple to see what your professor has done. He is assuming the MSB is the sign bit and the rest is treated like a 7 bit integer. There is a precedent for this, called "Signed Magnitude Representation" but it's not used in modern computing. These days pretty much everyone is using Two's compliment.
I take it this is a beginners course and rather than go though all the trouble of explaining two's compliment and data types your professor is mainly trying to drive home the point of the MSB being a sign bit. If you got the whole sign bit thing and don't know anything else about the way modern computer hardware performs calculations, then you would probably arrive at the same answer.
My guess is that your professor also took to wording the question in a strange way so as to throw you off the path if you tried to Google the answer. If you want to get him back, ask him what the difference between "1000 0000" and "000 0000" is. Also if you or anyone else in the class answered -6 and he counted it wrong, he should be fired. Those students should be awarded bonus points for teaching themselves about two's compliment.
Why would the signed mantissa be -6? I see that the 2's complement is -6 but signed mantissa is different?
I have you read the wiki article I linked to on "Significand"?
The important thing to realize is that "signed mantissa" is not a data type. However, there are (were?) many different machine-specific data types that implemented their own versions of storing floating point numbers before the IEEE standard became widely adopted. These early data types were often referred to as DFP or decimal floating point numbers as opposed to binary floating point. Read this paper and for more in-depth understanding. Also this paper covers the topic quite well.
As I stated earlier, your professor most likely used the terminology "signed mantissa" to throw you off if you went searching the internet for the answer. Apparently you were expected to read between the lines and know that what he was really asking for was a form of decimal floating point, or Signed Magnitude Representation.
"Signed Mantissa" ≠ Two's Compliment
"Signed Mantissa" is to be interpreted as some form of Decimal Floating Point
Where as, Two's Compliment is a form of Binary Floating Point
I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.
double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443
Guillaume Da Graça, David Defour
Implementation of float-float operators on graphics hardware,
7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf
Andrew Thall
Extended-Precision Floating-Point Numbers for GPU Computation.
This is not going to be simple.
A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).
A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).
You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?
Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")
It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.
If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.
(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)
Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.
A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.
Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.
[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.
This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.
However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.
If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.
Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.
That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.
If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.
I have been working on these three lab questions for about 5 hours. I'm stuck on the last question.
Consider the following floating point number representation which
stores a floating point number is 16 bits. You have a sign-bit, a six
bit (excess-32) exponent, and a nine bit mantissa.
Explain how the 9-bit mantissa might get you into trouble.
Here is the preceding question. Not sure if it will help in analysis.
What is the range of exponents it supports?
000000 to 111111 or 0 to 63 where exponent values less
than 32 are negative, and exponent values greater than 32 are
positive.
I have a pretty good foundation for floating points and converting between decimals and floating points. Any guidance would be greatly appreciated.
To me, the ratio mantissa to exponent is a bit off. Even if we assume there is a hidden bit, effectively making this into a 10 bit mantissa (with top bit always set), you can represent + or - 2^31, but in 2^31/2^10 = 2^21 steps (i.e. steps of 2097152).
I'd rather use 11 bits mantissa and 5 bit exponent, making this 2^15/2^11 = 2^4, i.e. steps of 16.
So for me the trouble would be that 9+1 bits is simply too imprecise, compared to the relatively large exponent.
My guess is that a nine-bit mantissa simply provides far too little precision, so that any operation apart from trivial ones, will make the calculation far too inexact to be useful.
I admit this answer is a little bit far-fetched, but apart from this, I can't see a problem with the representation.