this is what i found: " Cores perform only single-precision floating-point arithmetics. There is 1 double-precision floating-point unit. "
is this true for all compute capabilities (versions) ?
Single and double precision floating point accuracy and performance has continuously evolved and is different for each of the compute capabilities.
http://developer.download.nvidia.com/assets/cuda/files/NVIDIA-CUDA-Floating-Point.pdf
http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
section 5.4.1, table 5-1.
Related
In my computer architecture course we use a 14 bit binary model;(1 bit for sign,5 bits for exponent, and 8 bits for our mantissa). When inputting the Exponent my instructor has us add 16 to offset it.(bias 16) Why are we using 16 bias? Is it because 5 bits can only represent up to 31 numbers? If so please elaborate and compare to IEEE single precision that uses a 127 bias when using the exponent. Lastly if someone can give me a clear definition of bias used in this context and in binary I would greatly appreciate it. Please comment if anything I said was unclear.
The IEEE 754 binary float formats follow a simple pattern for the exponent bias. When the exponent has p bits the bias is . With this the exponent has an equal number of positive and negative exponents.
For single precision floats p is 8 and therefore the bias is 127. For your format p is 5 and the bias is 15. Maybe your instructor changed the bias to 16 because the format don't support denorm, infinity and NaN.
There are several ways of representing a range of numbers including both positive and negative. Adding a bias is particularly flexible. The range [-n, m) can be represented by adding n to each number, mapping it to the range [0, m+n).
That system is used for the exponent in all the floating point systems I have used. It simplifies some comparisons, because larger unsigned binary value of the non-sign bits represents larger absolute magnitude of the float, except for special values such as NaNs.
For float exponents, the bias is around half the exponent range, so that approximately half the values are on each side of zero. Exact balance is impossible because there are an even number of bit patterns, and one is used for zero.
As discussed in another answer, the IEEE 754 standard would use a bias of 15 for a 5 bit exponent.
There are several possible reasons for choosing 16:
There is some actual technical reason, such as the suggested one of not treating 31 as special.
Bias 16 makes the representation of 1.0 particularly simple, with a single non-zero bit.
Being subtly different from IEEE 754 helps convince students that floating point does not imply IEEE 754. There are other floating point formats.
Being subtly different from IEEE 754 may discourage use of existing tools to get the results for exercises without understanding how the representation works.
It is an arbitrary choice of one of the reasonable values for the exponent bias, without reference to IEEE 754.
According to the Kepler architecture whitepaper, a SMX has 192 CUDA cores and 64 Double Precision Units (DPUs). For a K20Xm there are 14 SMXs totalling at 2688 cores, which means that only the CUDA cores are counted. What exactly is then the usage of the DPUs for and how is their usage related to the cores?
My thoughts:
a) The CUDA cores can't do double precision operations and only the DPUs can. Therefore, the CUDA cores are free for other stuff while the DPUs are busy.
b) The CUDA cores somehow need a double precision unit to do double precision operations, therefore only 128 of the 192 CUDA cores are available for other stuff.
Cheers
Andi
The double precision units are actually separate hardware floating point units that do double precision arithmetic. They are independent from the "cuda cores", which roughly speaking, could be considered to be the single-precision units.
So for single precision arithmetic, the throughput can be computed based on the "cuda cores" or single precision units. For double precision arithmetic, the throughput must be computed based on the double precision units.
In a Kepler K20 SMX, the ratio of double-precision units to single precision units is 1:3. Therefore the throughput for each type of arithmetic follows the same ratio. By "arithmetic" I mean here floating point multiply or floating point add.
I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.
double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443
Guillaume Da Graça, David Defour
Implementation of float-float operators on graphics hardware,
7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf
Andrew Thall
Extended-Precision Floating-Point Numbers for GPU Computation.
This is not going to be simple.
A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).
A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).
You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?
Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")
It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.
If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.
(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)
Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.
A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.
Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.
[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.
This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.
However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.
If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.
Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.
That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.
If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.
I am researching for a possible gpu based teraflop computing machine...
the benchmark to be used will be LINPACK
now heres the problem; going through linpack documentation it says that it calculates in full precision and not in double precision ,for some machines full precision can be single precision. Can some one plz throw some light on the difference as this will dictate if I should go for the GTX 590s or the Tesla 2070s.
I think the term "full precision" was chosen to cover both IEEE-754 double precision (this is what is used on the GPUs mentioned) and the "single precision" format of old Cray vector computers, which sported 1 sign bit, 15 exponent bits, and 48 mantissa bits, providing a larger range but slightly less precision than IEEE-754 double precision. Here is documentation for the floating-point format used on the Cray-1:
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html#p3-20
Concerning official nVidia's HPL version 0.8 (that's what we use to benchmark our hybrid machines):
It will run only on Teslas (it works only if your GPU has more than 2 GiB of memory, which, as far as I know, is true only for Tesla)
It uses double precision, so another point for using Teslas, since double arithmetic performance is limited on mainstream GPUs.
BTW: achieving at least 50% efficiency on 6-node machine (2 GPUs per node) is considered barely possible.
there is a small error between CPU and GPU double precision results, using a fermi GPU.
e.g. for a small test set, I get the following absolute error for: (Number 1(CPU) - Number 2(GPU)) = 3E-018.
in binary form it is as expected very small…
NUMBER 1 in binary:
xxxxxxxxxxxxx11100000001001
vs
NUMBER 2 in binary:
xxxxxxxxxxxx111100000001010
Although this is a difference of one binary digit, I am keen to eliminate any differences, as the errors addup during my code.
any tips from those familiar with fermi? if this is unavoidable can I get C/C++ to mimic the fermi rounding off behaviour?
You should take a look at this post.
Floating point is not associative, so if a compiler chooses to do operations in a different order then you'll get a different result. Two versions of the same compiler can produce differences! Different compilers are even more likely to produce differences, and if you're doing work in parallel on the GPU (you are, right?) then you're inherently doing operations in a different order...
Fermi hardware is IEEE754-2008 compliant, which means that in addition to IEEE754 standard rounding it also has the fused multiply-add (FMA) instruction which avoids losing precision between multiplication and addition.