CUDA sincospi function precision - cuda

I was looking all over and I couldn't find how the function computes or uses it's PI part.
For my project I am using a defined constant that has precision of 34 decimal places for PI. However, this is much more than the normal math.h defined constant for PI which is 16 decimal places.
My question is how precise of PI is sincospi using to compute its answer? Is it just using the PI constant from math.h?

The maximum ulp error for sincospi() is the same as the maximum ulp error for sincos(): The results differ by up to 1 ulp from a correctly rounded reference, or stated differently, the results deviate by less than 1.5 ulps from the infinitely precise mathematical result. On average, more than 90+% of results are correctly rounded. The maximum observed error for all CUDA math functions is documented in an appendix of the CUDA C Programming Guide. The accuracy of sincospi(x) is superior to the accuracy achieved by sincos(M_PI*x), as the latter incurs additional rounding error due to the multiplication of the function argument.

Related

Is there a CUDA equivalent of native_recip() in OpenCL?

OpenCL has a built-in function named native_recip:
gentype native_recip(gentype x);
native_recip computes reciprocal over an implementation-defined range. The maximum error is implementation-defined.
The vector versions of the math functions operate component-wise. The description is per-component.
The built-in math functions are not affected by the prevailing rounding mode in the calling environment, and always return the same value as they would if called with the round to nearest even rounding mode.
Is there an equivalent to this function in CUDA?
As noted in comments, it's __frcp_rn() for float's and __drcp_rn() for double's; and an implementation for vector types (e.g. float4) such that frcp/drcp is applied elementwise.
Note: "rcp" is short for "reciprocal" and "rn" is for the rounding mode "round to nearest even".

CUDA out of memory message after using just ~2.2GB of memory on a GTX1080

I'm doing matrix multiplication on a GTX1080 GPU using JCuda, version 0.8.0RC with CUDA 8.0. I load two matrices A and B into the device in row-major vector form, and read the product matrix from the device. But I'm finding that I run out of device memory earlier than I would expect. For example, if matrix A is dimensioned 100000 * 5000 = 500 million entries = 2GB worth of float values, then:
cuMemAlloc(MatrixA, 100000 * 5000 * Sizeof.FLOAT);
works fine. But if I increase the number or rows to 110000 from 100000, I get the following error on this call (which is made before the memory allocations for matrices B and C, so those are not part of the problem):
Exception in thread "main" jcuda.CudaException: CUDA_ERROR_OUT_OF_MEMORY
at jcuda.driver.JCudaDriver.checkResult(JCudaDriver.java:344)
at jcuda.driver.JCudaDriver.cuMemAlloc(JCudaDriver.java:3714)
at JCudaMatrixMultiply.main(JCudaMatrixMultiply.java:84) (my code)
The issue is that allocating a matrix of this size on the device should take only about 2.2GB, and the GTX1080 has 8GB of memory, so I don't see why I'm running out of memory. Does anyone have any thoughts on this? It's true that I'm using the JCuda 0.8.0RC with the release version of CUDA 8, but I tried downloading the RC version of CUDA 8 (8.0.27) to use with JCuda 0.8.0RC and had some problems getting it to work. If versions compatibility is likely to be the issue, however, I can try again.
Matrices of 100000 * 5000 are pretty big, of course, and I won't need to work with larger matrices for a while on my neural network project, but I would like to be confident that I can use all 8GB of memory on this new card. Thanks for any help.
tl;dr:
When calling
cuMemAlloc(MatrixA, (long)110000 * 5000 * Sizeof.FLOAT);
// ^ cast to long here
or alternatively
cuMemAlloc(MatrixA, 110000L * 5000 * Sizeof.FLOAT);
// ^ use the "long" literal suffix here
it should work.
The last argument to cuMemAlloc is of type size_t. This is an implementation-specific unsigned integer type for "arbitrary" sizes. The closest possible primitive type in Java for this is long. And in general, every size_t in CUDA is mapped to long in JCuda. In this case, the Java long is passed as a jlong into the JNI layer, and this is simply cast to size_t for the actual native call.
(The lack of unsigned types in Java and the odd plethora of integer types in C can still cause problems. Sometimes, the C types and the Java types just don't match. But as long as the allocation is not larger than 9 Million Terabytes (!), a long should be fine here...)
But the comment by havogt lead to the right track. What happens here is indeed an integer overflow: The computation of the actual value
110000 * 5000 * Sizeof.FLOAT = 2200000000
is by default done using the int type in Java, and this is where the overflow happens: 2200000000 is larger than Integer.MAX_VALUE. The result will be a negative value. When this is cast to the (unsigned) size_t value in the JNI layer, it will become a ridiculosly large positive value, that clearly causes the error.
When doing the computation using long values, either by explicitly casting to long or by appending the L suffix to one of the literals, the value is passed to CUDA as the proper long value of 2200000000.

CUDA, float precision

I am using CUDA 4.0 on Geforce GTX 580 (Fermi) . I have numbers as small as 7.721155e-43 . I want to multiply them with each other just once or better say I want to calculate 7.721155e-43 * 7.721155e-43 .
My experience showed me I can't do it just straight forward. Could you please give me suggestion? Do I need to use double precision? How?
The magnitude of the smallest normal IEEE single-precision number is about 1.18e-38, the smallest denormal gets you down to about 1.40e-45. As a consequece an operand of magnitude 7.82e-43 will comprise only about 9 non-zero bits, which in itself may already be a problem, even before you get to the multiplication (whose result will underflow to zero in single precision). So you may also want to look at any up-stream computation that produces these tiny numbers.
If these small numbers are intermediate terms in a mathematical expression, rewriting that expression into a mathematically equivalent one that does not involve tiny intermediates would be one way of addressing the issue. Or you could scale some operands by factors that are powers of two (so as to not incur additional round-off due to the scaling). For example, scale by 2^24 = 16777216.
Lastly, you can switch part of the computation to double precision. To do so, simply introduce temporary variables of type double, perform the computation on them, then convert the final result back to float:
float r, f = 7.721155e-43f;
double d, t;
d = (double)f; // explicit cast is not necessary, since converting to wider type
t = d * d;
[... more intermediate computation, leaving result in 't' ...]
r = (float)t; // since conversion is to narrower type, cast will avoid warnings
In statistics we often have to work with likelihoods that end up being very small numbers and the standard technique is to use logs for everything. Then multiplication on a log scale is just addition. All intermediate numbers are stored as logs. Indeed it can take a bit of getting used to - but the alternative will often fail even when doing relatively modest computations. In R (for my convenience!) which uses doubles and prints 7 significant figures by default btw:
> 7.721155e-43 * 7.721155e-43
[1] 5.961623e-85
> exp(log(7.721155e-43) + log(7.721155e-43))
[1] 5.961623e-85

Difference between double precision and full precision floating

I am researching for a possible gpu based teraflop computing machine...
the benchmark to be used will be LINPACK
now heres the problem; going through linpack documentation it says that it calculates in full precision and not in double precision ,for some machines full precision can be single precision. Can some one plz throw some light on the difference as this will dictate if I should go for the GTX 590s or the Tesla 2070s.
I think the term "full precision" was chosen to cover both IEEE-754 double precision (this is what is used on the GPUs mentioned) and the "single precision" format of old Cray vector computers, which sported 1 sign bit, 15 exponent bits, and 48 mantissa bits, providing a larger range but slightly less precision than IEEE-754 double precision. Here is documentation for the floating-point format used on the Cray-1:
http://ed-thelen.org/comp-hist/CRAY-1-HardRefMan/CRAY-1-HRM.html#p3-20
Concerning official nVidia's HPL version 0.8 (that's what we use to benchmark our hybrid machines):
It will run only on Teslas (it works only if your GPU has more than 2 GiB of memory, which, as far as I know, is true only for Tesla)
It uses double precision, so another point for using Teslas, since double arithmetic performance is limited on mainstream GPUs.
BTW: achieving at least 50% efficiency on 6-node machine (2 GPUs per node) is considered barely possible.

fermi cuda double precision against C

there is a small error between CPU and GPU double precision results, using a fermi GPU.
e.g. for a small test set, I get the following absolute error for: (Number 1(CPU) - Number 2(GPU)) = 3E-018.
in binary form it is as expected very small…
NUMBER 1 in binary:
xxxxxxxxxxxxx11100000001001
vs
NUMBER 2 in binary:
xxxxxxxxxxxx111100000001010
Although this is a difference of one binary digit, I am keen to eliminate any differences, as the errors addup during my code.
any tips from those familiar with fermi? if this is unavoidable can I get C/C++ to mimic the fermi rounding off behaviour?
You should take a look at this post.
Floating point is not associative, so if a compiler chooses to do operations in a different order then you'll get a different result. Two versions of the same compiler can produce differences! Different compilers are even more likely to produce differences, and if you're doing work in parallel on the GPU (you are, right?) then you're inherently doing operations in a different order...
Fermi hardware is IEEE754-2008 compliant, which means that in addition to IEEE754 standard rounding it also has the fused multiply-add (FMA) instruction which avoids losing precision between multiplication and addition.