Is there a CUDA equivalent of native_recip() in OpenCL? - cuda

OpenCL has a built-in function named native_recip:
gentype native_recip(gentype x);
native_recip computes reciprocal over an implementation-defined range. The maximum error is implementation-defined.
The vector versions of the math functions operate component-wise. The description is per-component.
The built-in math functions are not affected by the prevailing rounding mode in the calling environment, and always return the same value as they would if called with the round to nearest even rounding mode.
Is there an equivalent to this function in CUDA?

As noted in comments, it's __frcp_rn() for float's and __drcp_rn() for double's; and an implementation for vector types (e.g. float4) such that frcp/drcp is applied elementwise.
Note: "rcp" is short for "reciprocal" and "rn" is for the rounding mode "round to nearest even".

Related

CUDA tridiagonal solver function (cusparse)

In my CUDA code I am using cusparse<t>gtsv() function (more precisely, cusparseZgtsv and cusparseZgtsvStridedBatch ones).
In the documentaion it is said, that this function solves the equation A*x=alpha *B. My question is - what is alpha? I didn't find it as an input parameter. I have no idea how to specify it. Is it always equals to 1?
I performed some testing (solved some random systems of equations where tridiagonal matrices were always diagonally dominant and checked my solution using direct matrix by vector multiplication).
It looks like in the current version alpha = 1 always, so one can just ignore it. I suspect that it will be added as an input parameter in future releases.

How to compute the maximum of a function?

How to compute the maximum of a smooth function defined on [a,b] in Fortran ?
For simplicity, a polynomial function.
The background is that almost all numerical flux(a concept in numerical PDE) involves computing the maximum of certain function over an interval [a,b].
For a 1-D problem with smooth and readily-computed derivatives, use Newton-Raphson to find zeros of the first derivative.
For multiple dimensions, and readily-computed derivatives, you're better off using a method that approximates the Hessian. There are several methods of this type, but I've found the L-BFGS method to be reliable and efficient. There a convenient, BSD-licensed package provided by a group at Northwestern University. There's also quite a bit of well-tested code at http://www.netlib.org/

CUDA Texture Linear Filtering

In the CUDA C Programming Guide Version 5, Appendix E.2 (Linear Filtering), it is stated that:
In this filtering mode, which is only available for floating-point
textures, the value returned by the texture fetch is...
The part in bold case is confusing me. Does floating point mean the texel type only, or the return type also? For example, I declare 3 textures as follows.
texture<float,cudaTextureType2D> tex32f;
texture<unsigned char, cudaTextureType2D, cudaReadModeNormalizedFloat> tex8u;
texture<unsigned short, cudaTextureType2D, cudaReadModeNormalizedFloat> tex16u;
Is linear filtering available for tex32f only, or also for tex8u and tex16u?
It means that linear filtering is available only when the "read mode" of the texture is cudaReadModeNormalizedFloat, i.e. integer types (such as u8) get promoted to floating point values in the range [0.0, 1.0] (for unsigned integers) or [-1.0, 1.0] (for signed integers).

CUDA sincospi function precision

I was looking all over and I couldn't find how the function computes or uses it's PI part.
For my project I am using a defined constant that has precision of 34 decimal places for PI. However, this is much more than the normal math.h defined constant for PI which is 16 decimal places.
My question is how precise of PI is sincospi using to compute its answer? Is it just using the PI constant from math.h?
The maximum ulp error for sincospi() is the same as the maximum ulp error for sincos(): The results differ by up to 1 ulp from a correctly rounded reference, or stated differently, the results deviate by less than 1.5 ulps from the infinitely precise mathematical result. On average, more than 90+% of results are correctly rounded. The maximum observed error for all CUDA math functions is documented in an appendix of the CUDA C Programming Guide. The accuracy of sincospi(x) is superior to the accuracy achieved by sincos(M_PI*x), as the latter incurs additional rounding error due to the multiplication of the function argument.

fermi cuda double precision against C

there is a small error between CPU and GPU double precision results, using a fermi GPU.
e.g. for a small test set, I get the following absolute error for: (Number 1(CPU) - Number 2(GPU)) = 3E-018.
in binary form it is as expected very small…
NUMBER 1 in binary:
xxxxxxxxxxxxx11100000001001
vs
NUMBER 2 in binary:
xxxxxxxxxxxx111100000001010
Although this is a difference of one binary digit, I am keen to eliminate any differences, as the errors addup during my code.
any tips from those familiar with fermi? if this is unavoidable can I get C/C++ to mimic the fermi rounding off behaviour?
You should take a look at this post.
Floating point is not associative, so if a compiler chooses to do operations in a different order then you'll get a different result. Two versions of the same compiler can produce differences! Different compilers are even more likely to produce differences, and if you're doing work in parallel on the GPU (you are, right?) then you're inherently doing operations in a different order...
Fermi hardware is IEEE754-2008 compliant, which means that in addition to IEEE754 standard rounding it also has the fused multiply-add (FMA) instruction which avoids losing precision between multiplication and addition.