Hope this hasn't already been asked, but... is there any simple way to get high precision floats (something lime 1024 bits precision) working on CUDA without having to code it from scratch? I'd need something very simple, and I need only operator + and *... is this possible?
CUDA Compute 1.3 and above cards will do double precision out of the box without you having to implement it yourself. Basically anything after GTX 280 and Quadro 5800. If you need more precision than that then you will have implement it yourself.
Related
I am writing a program for an embedded hardware that only supports 32-bit single-precision floating-point arithmetic. The algorithm I am implementing, however, requires a 64-bit double-precision addition and comparison. I am trying to emulate double datatype using a tuple of two floats. So a double d will be emulated as a struct containing the tuple: (float d.hi, float d.low).
The comparison should be straightforward using a lexicographic ordering. The addition however is a bit tricky because I am not sure which base should I use. Should it be FLT_MAX? And how can I detect a carry?
How can this be done?
Edit (Clarity): I need the extra significant digits rather than the extra range.
double-float is a technique that uses pairs of single-precision numbers to achieve almost twice the precision of single precision arithmetic accompanied by a slight reduction of the single precision exponent range (due to intermediate underflow and overflow at the far ends of the range). The basic algorithms were developed by T.J. Dekker and William Kahan in the 1970s. Below I list two fairly recent papers that show how these techniques can be adapted to GPUs, however much of the material covered in these papers is applicable independent of platform so should be useful for the task at hand.
https://hal.archives-ouvertes.fr/hal-00021443
Guillaume Da Graça, David Defour
Implementation of float-float operators on graphics hardware,
7th conference on Real Numbers and Computers, RNC7.
http://andrewthall.org/papers/df64_qf128.pdf
Andrew Thall
Extended-Precision Floating-Point Numbers for GPU Computation.
This is not going to be simple.
A float (IEEE 754 single-precision) has 1 sign bit, 8 exponent bits, and 23 bits of mantissa (well, effectively 24).
A double (IEEE 754 double-precision) has 1 sign bit, 11 exponent bits, and 52 bits of mantissa (effectively 53).
You can use the sign bit and 8 exponent bits from one of your floats, but how are you going to get 3 more exponent bits and 29 bits of mantissa out of the other?
Maybe somebody else can come up with something clever, but my answer is "this is impossible". (Or at least, "no easier than using a 64-bit struct and implementing your own operations")
It depends a bit on what types of operations you want to perform. If you only care about additions and subtractions, Kahan Summation can be a great solution.
If you need both the precision and a wide range, you'll be needing a software implementation of double precision floating point, such as SoftFloat.
(For addition, the basic principle is to break the representation (e.g. 64 bits) of each value into its three consitituent parts - sign, exponent and mantissa; then shift the mantissa of one part based on the difference in the exponents, add to or subtract from the mantissa of the other part based on the sign bits, and possibly renormalise the result by shifting the mantissa and adjusting the exponent correspondingly. Along the way, there are a lot of fiddly details to account for, in order to avoid unnecessary loss of accuracy, and deal with special values such as infinities, NaNs, and denormalised numbers.)
Given all the constraints for high precision over 23 magnitudes, I think the most fruitful method would be to implement a custom arithmetic package.
A quick survey shows Briggs' doubledouble C++ library should address your needs and then some. See this.[*] The default implementation is based on double to achieve 30 significant figure computation, but it is readily rewritten to use float to achieve 13 or 14 significant figures. That may be enough for your requirements if care is taken to segregate addition operations with similar magnitude values, only adding extremes together in the last operations.
Beware though, the comments mention messing around with the x87 control register. I didn't check into the details, but that might make the code too non-portable for your use.
[*] The C++ source is linked by that article, but only the gzipped tar was not a dead link.
This is similar to the double-double arithmetic used by many compilers for long double on some machines that have only hardware double calculation support. It's also used as float-float on older NVIDIA GPUs where there's no double support. See Emulating FP64 with 2 FP32 on a GPU. This way the calculation will be much faster than a software floating-point library.
However in most microcontrollers there's no hardware support for floats so they're implemented purely in software. Because of that, using float-float may not increase performance and introduce some memory overhead to save the extra bytes of exponent.
If you really need the longer mantissa, try using a custom floating-point library. You can choose whatever is enough for you, for example change the library to adapt a new 48-bit float type of your own if only 40 bits of mantissa and 7 bits of exponent is needed. No need to spend time for calculating/storing the unnecessary 16 bits anymore. But this library should be very efficient because compiler's libraries often have assembly level optimization for their own type of float.
Another software-based solution that might be of use: GNU MPFR
It takes care of many other special cases and allows arbitrary precision (better than 64-bit double) that you would have to otherwise take care of yourself.
That's not practical. If it was, every embedded 32-bit processor (or compiler) would emulate double precision by doing that. As it stands, none do it that I am aware of. Most of them just substitute float for double.
If you need the precision and not the dynamic range, your best bet would be to use fixed point. IF the compiler supports 64-bit this will be easier too.
Theoretically, you can have 65535 blocks per dimension of the grid, up to 65535 * 65535 * 65535.
If you call a kernel like this:
kernel<<< BLOCKS,THREADS >>>()
(without dim3 objects), what is the maximum number available for BLOCKS?
In an application of mine, I've set it up to 192000 and seemed to work fine... The problem is that the kernel I used changes the contents of a huge array, so although I checked some parts of the array and seemed fine, I can't be sure whether the kernel behaved strangely at other parts.
For the record I have a 2.1 GPU, GTX 500 ti.
With compute capability 3.0 or higher, you can have up to 2^31 - 1 blocks in the x-dimension, and at most 65535 blocks in the y and z dimensions. See Table H.1. Feature Support per Compute Capability of the CUDA C Programming Guide Version 9.1.
As Pavan pointed out, if you do not provide a dim3 for grid configuration, you will only use the x-dimension, hence the per dimension limit applies here.
In case anybody lands here based on a Google search (as I just did):
Nvidia changed the specification since this question was asked. With compute capability 3.0 and newer, the x-Dimension of a grid of thread blocks is allowed to be up to 2'147'483'647 or 2^31 - 1.
See the current: Technical Specification
65535 in a single dimension. Here's the complete table
I manually checked on my laptop (MX130), program crashes when #blocks > 678*1024+651. Each block with 1 thread, Adding even a single more block gives SegFault. Kernal code had no grid, linear structure only.
I am researching for a possible gpu based teraflop computing machine...
the benchmark to be used will be LINPACK
now heres the problem; going through linpack documentation it says that it calculates in full precision and not in double precision ,for some machines full precision can be single precision. Can some one plz throw some light on the difference as this will dictate if I should go for the GTX 590s or the Tesla 2070s.
I think the term "full precision" was chosen to cover both IEEE-754 double precision (this is what is used on the GPUs mentioned) and the "single precision" format of old Cray vector computers, which sported 1 sign bit, 15 exponent bits, and 48 mantissa bits, providing a larger range but slightly less precision than IEEE-754 double precision. Here is documentation for the floating-point format used on the Cray-1:
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html#p3-20
Concerning official nVidia's HPL version 0.8 (that's what we use to benchmark our hybrid machines):
It will run only on Teslas (it works only if your GPU has more than 2 GiB of memory, which, as far as I know, is true only for Tesla)
It uses double precision, so another point for using Teslas, since double arithmetic performance is limited on mainstream GPUs.
BTW: achieving at least 50% efficiency on 6-node machine (2 GPUs per node) is considered barely possible.
there is a small error between CPU and GPU double precision results, using a fermi GPU.
e.g. for a small test set, I get the following absolute error for: (Number 1(CPU) - Number 2(GPU)) = 3E-018.
in binary form it is as expected very small…
NUMBER 1 in binary:
xxxxxxxxxxxxx11100000001001
vs
NUMBER 2 in binary:
xxxxxxxxxxxx111100000001010
Although this is a difference of one binary digit, I am keen to eliminate any differences, as the errors addup during my code.
any tips from those familiar with fermi? if this is unavoidable can I get C/C++ to mimic the fermi rounding off behaviour?
You should take a look at this post.
Floating point is not associative, so if a compiler chooses to do operations in a different order then you'll get a different result. Two versions of the same compiler can produce differences! Different compilers are even more likely to produce differences, and if you're doing work in parallel on the GPU (you are, right?) then you're inherently doing operations in a different order...
Fermi hardware is IEEE754-2008 compliant, which means that in addition to IEEE754 standard rounding it also has the fused multiply-add (FMA) instruction which avoids losing precision between multiplication and addition.
say I want to load an array of short from global memory to shared memory. I am not sure how coalescing works here. On best practice guide, it says on device of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed.
If I understand it correctly, in case I break my data into 32bytes (16 shorts) segments, thread id 0, 16, 32 ... has to access first element of each segment? do i need to consider 64bytes alignment or 128 bytes alignment as well? I have a gts 250, so i guess this is important. Advices are welcomed. thanks.
According to Section G.3.2.1 of the CUDA Programming Guide short will not coalesce on Compute Capability 1.0 and 1.1 devices under any circumstances. Specifically, it states:
The size of the words accessed by the
threads must be 4, 8, or 16 bytes
You can however use vector types such as short2, short4, or even short8 to get coalesced access. The coalescing rules for these types is spelled out in Section G.3.2.1 as well. However, as far as coalescing is concerned a short2 is just like a 32-bit int.
FWIW, devices with Compute Capability 1.3 or greater handle types like char and short much better. Reading chars on a 1.3 device might give you as much as ~60% of peak memory bandwidth vs. ~10% of peak memory bandwidth on a 1.0 or 1.1 device.