Handling Double values on CUDA ( Compute Capability 1.1) [duplicate] - cuda

I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.

double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443
Guillaume Da Graça, David Defour
Implementation of float-float operators on graphics hardware,
7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf
Andrew Thall
Extended-Precision Floating-Point Numbers for GPU Computation.

This is not going to be simple.
A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).
A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).
You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?
Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")

It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.

If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.
(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)

Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.
A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.
Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.
[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.

This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.
However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.
If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.

Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.

That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.
If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.

Related

How best to represent negative-binary numbers for use in counters and comparators in Verilog HDL?

I have been designing a Delta-Sigma DAC and have run into confusion and despair over the handling of signed numbers in my (sigma)counters and (delta and Vref) comparators.
I have tried to employ signed 2's complement but the EDAcompiler doesn't seem to notice when I do it, its most likely my own mistake!
So basically my question is, how (in Verilog) do I represent negative numbers in a way that they can be used in counters (which can therefore count up and down)? I am aware that a counter register that will hold signed numbers must be declared reg signed [:0]
Thanks!
Gavin
Well I am not that clear on your question. Some compilers may not be able to handle signed registers directly since if I recall correctly it is a Verilog 2001 feature. But generally if you use digital logic that works with signed numbers you shouldn't have an issue. For example if you use an adder ip just mention that inputs are signed numbers. As for the simulator you can select the type of data you need, generally by selecting the register/value, right clicking on the waveform window and changing the type.
Finally if you have to create the logic yourself just use sign extension. so lets say you are working with 4 bit values
-5 would be 11111011 and 5 wold be 00000101. So you can see that for negative numbers the MSB is 1 and for positive its zero. Using this you can interpret the numbers in your code but just make sure that the size is bigger then what you want to use so no overflow occurs.

Difference between double precision and full precision floating

I am researching for a possible gpu based teraflop computing machine...
the benchmark to be used will be LINPACK
now heres the problem; going through linpack documentation it says that it calculates in full precision and not in double precision ,for some machines full precision can be single precision. Can some one plz throw some light on the difference as this will dictate if I should go for the GTX 590s or the Tesla 2070s.
I think the term "full precision" was chosen to cover both IEEE-754 double precision (this is what is used on the GPUs mentioned) and the "single precision" format of old Cray vector computers, which sported 1 sign bit, 15 exponent bits, and 48 mantissa bits, providing a larger range but slightly less precision than IEEE-754 double precision. Here is documentation for the floating-point format used on the Cray-1:
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html#p3-20
Concerning official nVidia's HPL version 0.8 (that's what we use to benchmark our hybrid machines):
It will run only on Teslas (it works only if your GPU has more than 2 GiB of memory, which, as far as I know, is true only for Tesla)
It uses double precision, so another point for using Teslas, since double arithmetic performance is limited on mainstream GPUs.
BTW: achieving at least 50% efficiency on 6-node machine (2 GPUs per node) is considered barely possible.

fermi cuda double precision against C

there is a small error between CPU and GPU double precision results, using a fermi GPU.
e.g. for a small test set, I get the following absolute error for: (Number 1(CPU) - Number 2(GPU)) = 3E-018.
in binary form it is as expected very small…
NUMBER 1 in binary:
xxxxxxxxxxxxx11100000001001
vs
NUMBER 2 in binary:
xxxxxxxxxxxx111100000001010
Although this is a difference of one binary digit, I am keen to eliminate any differences, as the errors addup during my code.
any tips from those familiar with fermi? if this is unavoidable can I get C/C++ to mimic the fermi rounding off behaviour?
You should take a look at this post.
Floating point is not associative, so if a compiler chooses to do operations in a different order then you'll get a different result. Two versions of the same compiler can produce differences! Different compilers are even more likely to produce differences, and if you're doing work in parallel on the GPU (you are, right?) then you're inherently doing operations in a different order...
Fermi hardware is IEEE754-2008 compliant, which means that in addition to IEEE754 standard rounding it also has the fused multiply-add (FMA) instruction which avoids losing precision between multiplication and addition.

Real number arithmetic in a general purpose language?

As (hopefully) most of you know, floating point arithmetic is different from real number arithmetic. It's for starters imprecise. Many numbers, especially decimals (0.1, 0.3) cannot be represented, leading to problems like this. A more thorough list can be found here.
Are there any general purpose languages that have built-in support for something closer to real number arithmetic? If not, what are good libraries that support this?
EDIT: Arbitrary precision decimal
datatypes are not what I am looking
for. I want to be able to represent
numbers like 1/3, sqrt(3), or 1 + 2i as well.
Though I hate to say it, Fortran. It has extensive support for arbitrary-precision arithmetic and tons of support for big-number calculations. It's ancient and gross, but it gets the job done.
All the numbers used in your examples are algebraic numbers, and can be represented
finitely as roots of polynomials with integer coefficients.
The same cannot be said of real numbers in general, which is easily seen when one
considers that the reals are uncountable, but the set of computer programs is
countable. Therefore most reals will not have a finite representation in code.
What you are looking for is symbolic calculation (MATLAB and other tools used in math and engineering are good at it).
If you want a general purposed language, I think expression tree in C# is good point to start with. In the essence, the ability to store the expression (instead of evaluate the expression into real values) is the key to be able to perform symbolic calculation. Note that expression tree does not provide symbolic calculation, it just provides the data structure that supports symbolic calculation.
This question is interesting, but raises some issues. First, you will never be able to represent all the real numbers using a (even theoretically infinite) computer, for cardinality reasons.
What you are looking for is a "symbolic numbers" datatype. You can imagine some sort of expression tree, with predefined constants, arithmetical operations, and perhaps algebraic (roots of polynomials) and transcendantal (exp, sin, cos, log, etc) functions.
Now the fun part of the story: you cannot find an algorithm which tells whether two such trees are representing the same number (or equivalently, which test whether such a tree is zero). I won't state anything precise, but as a hint, this is similar to the Halting Problem (for computer scientists) or the Gödel Incompleteness Theorem (for mathematicians).
This renders such a class pretty useless.
For some subfields of the reals, you have canonical forms, like a/b for the rationals, or finite algebraic extensions of the rationals (a/b + ic/d for complex rationals, a/b + sqrt(2) * a/b for Q[sqrt(2)], etc). These can be used to represent some particular sets of algebraic numbers.
In practice, this is the most complicated thing you will need. If you have a particular necessity, like ranges of floating point numbers (to prove some result is whithin a specified interval, this is probably the closest you can get to real numbers), or arbitrary precision numbers, you have freely available classes everywhere. Google boost::range for the former, and gmp for the latter.
There are several languages with support for rational and complex numbers. Scheme, for instance, has support built in for arbitrarily precise rational numbers, and complex numbers with either rational, floating point, or integral coefficients:
> (+ 1/2 1/3)
5/6
> (* 3 1+1/2i)
3+3/2i
> (+ 1/2 .5)
1.0
If you want to go beyond rational numbers or complex numbers with rational coefficients, to algebraic numbers such as sqrt(2) or closed-form numbers like e, you will probably have to look beyond general purpose programming languages, and use a special purpose mathematical language like Mathematica or Maxima.
To cover the real numbers with any flair you'll need a symbolic package.
Boost, the C++ project, has a Rational library, but that's only part of the story.
You have irrational numbers in all sorts of forms (pi, base of the natural logarithm, square and cube roots, the Champernowne constant, to name only a few). The only way I know of to handle arithmetic operations is a symbolic package with smarts as to the relationship amongst all of these numbers. Assuming you could express e^pi, how would you add one to it? Or take the square root of it?
Mathematica might handle these cases.
Java: java.math.BigDecimal
C#: decimal
A lot of languages have support for that: Java has BigDecimal, Perl has Math::BigFloat and Math::BigRat, Haskell has Integer and a lot of libraries and languages are listed in the wikipedia.
Ada natively supports fixed-point math as well as floating-point. Fixed-point can be much more exact than floating-point, as long as the number's exponents remain in range.
If you need floating-points, but more precision than IEEE gives, there are bignum packages around for just about every language.
I think that's about the best you can do. Neither scheme can exactly represent repeating decimals (like 1/3). It would probably be possible to come up with a scheme that does, but I know of no language that supports such a thing with a built-in type. Even that won't help you with irrational numbers (like pi and e). I believe there's even a theorem that says there will always be unrepresentable numbers, no matter what scheme you come up with.
EDIT: Arbitrary precision decimal
datatypes are not what I am looking
for. I want to be able to represent
numbers like 1/3, sqrt(3), or 1 + 2i
as well.
Ruby has a Rational class, so 1/3 can be expressed exactly as Rational(1,3). It also has a Complex class.
Scheme defines rationals, bignums, floating point and complex numbers. An implementation is not required to support them all, but if they are present, you can mix them and they will to "the right thing".
While its not "built-in", I think C++ (maybe C#) is your best bet. There are classes out there that have been written for this purpose.
http://www.oonumerics.org/oon/

Help Understanding 8bit Floating Point Conversions with Decimals and Binary

I'm in a basic Engineering class and we're going through binary conversions. I can figure out the base 10 to binary or hex conversions really well, however the 8bit floating point conversions are kicking my ass and I can't find anything online that breaks it down in a n00b level and shows the steps? Wondering if any gurus have found anything online that would be helpful for this situation.
I have questions like 00101010(8bfp) = what number in base 10
Whenever I want to remember how floating point works, I refer back to the wikipedia page on 32 bit floats. I think it lays out the concepts pretty well.
http://en.wikipedia.org/wiki/Single_precision_floating-point_format
Note that wikipedia doesn't know what 8 bit floats are, I think your professor may have invented them ;)
Binary floating point formats are usually broken down into 3 fields: Sign bit, exponent and mantissa. The sign bit is simply set to 1 if the entire number should be negative, and 0 if the number is positive. The exponent is usually an unsigned int with an offset, where 2 to the 0'th power (1) is in the middle of the range. It's simpler in hardware and software to compare sizes this way. The mantissa works similarly to the mantissa in regular scientific notation, with the following caveat: The most significant bit is hidden. This is due to the requirement of normalizing scientific notation to have one significant digit above the decimal point. Remember when your math teacher in elementary school would whack your knuckles with a ruler for writing 35.648 x 10^6 or 0.35648 x 10^8 instead of the correct 3.5648 x 10^7? Since binary only has two states, this required digit above the decimal point is always one, and eliminating it allows another bit of accuracy at the low end of the mantissa.