Special CUDA Double Precision trig functions for SFU - cuda

I was wondering how I would go about using __cos(x) (and respectively __sin(x)) in the kernel code with CUDA. I looked up in the CUDA manual that there is such a device function however when I implement it the compiler just says that I cannot call a host function in the device.
However, I found that there are two sister functions cosf(x) and __cosf(x) the latter of which runs on the SFU and is overall much faster than the original cosf(x) function. The compiler does not complain about the __cosf(x) function of course.
Is there a library I'm missing? Am I mistaken about this trig function?

As the SFU only supports certain single-precision operations, there are no double-precision __cos() and __sin() device functions. There are single-precision __cosf() and __sinf() device functions, as well as other functions detailed in table C-4 of the CUDA 4.2 Programming Manual.
I assume you are looking for faster alternatives to the double-precision versions of the standard math functions sin() and cos()? If sine and cosine of the same argument are needed, sincos() should be used for a significant performance boost. If the argument of sine or cosine is multiplied by π, you would want to use sinpi(), cospi(), or sincospi() instead, for even more performance. For example, sincospi() is very useful when implementing the Box-Muller algorithm for generating normally distributed random numbers. Also, check out the CUDA 5.0 preview for best possible performance (note that the preview provides alpha-release quality).

Related

Do CUDA cores have vector instructions?

According to most NVidia documentation CUDA cores are scalar processors and should only execute scalar operations, that will get vectorized to 32-component SIMT warps.
But OpenCL has vector types like for example uchar8.It has the same size as ulong (64 bit), which can be processed by a single scalar core. If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
If there are 1024 work items in a block (work group), and each work items processes a uchar8, will this effectively process 8120 uchar in parallel?
Edit:
My question was if on CUDA architectures specifically (independently of OpenCL), there are some vector instructions available in "scalar" cores. Because if the core is already capable of handling a 32-bit type, it would be reasonable if it can also handle addition of a 32-bit uchar4 for example, especially since vector operations are often used in computer graphics.
CUDA has "built-in" (i.e. predefined) vector types up to a size of 4 for 4-byte quantities (e.g. int4) and up to a size of 2 for 8-byte quantities (e.g. double2). A CUDA thread has a maximum read/write transaction size of 16 bytes, so these particular size choices tend to line up with that maximum.
These are exposed as typical structures, so you can reference for example .x to access just the first element of a vector type.
Unlike OpenCL, CUDA does not provide built-in operations ("overloads") for basic arithmetic e.g. +, -, etc. for element-wise operations on these vector types. There's no particular reason you couldn't provide such overloads yourself. Likewise, if you wanted a uchar8 you could easily provide a structure definition for such, as well as any desired operator overloads. These could probably be implemented just as you would expect for ordinary C++ code.
Probably an underlying question is, then, what is the difference in implementation between CUDA and OpenCL in this regard? If I operate on a uchar8, e.g.
uchar8 v1 = {...};
uchar8 v2 = {...};
uchar8 r = v1 + v2;
what will the difference be in terms of machine performance (or low-level code generation) between OpenCL and CUDA?
Probably not much, for a CUDA-capable GPU. A CUDA core (i.e. the underlying ALU) does not have direct native support for such an operation on a uchar8, and furthermore, if you write your own C++ compliant overload, you're probably going to use C++ semantics for this which will inherently be serial:
r.x = v1.x + v2.x;
r.y = v1.y + v2.y;
...
So this will decompose into a sequence of operations performed on the CUDA core (or in the appropriate integer unit within the CUDA SM). Since the NVIDIA GPU hardware doesn't provide any direct support for an 8-way uchar add within a single core/clock/instruction, there's really no way OpenCL (as implemented on a NVIDIA GPU) could be much different. At a low level, the underlying machine code is going to be a sequence of operations, not a single instruction.
As an aside, CUDA (or PTX, or CUDA intrinsics) does provide for a limited amount of vector operations within a single core/thread/instruction. Some examples of this are:
a limited set of "native" "video" SIMD instructions. These instructions are per-thread, so if used, they allow for "native" support of up to 4x32 = 128 (8-bit) operands per warp, although the operands must be properly packed into 32-bit registers. You can access these from C++ directly via a set of built-in intrinsics. (A CUDA warp is a set of 32 threads, and is the fundamental unit of lockstep parallel execution and scheduling on a CUDA capable GPU.)
a vector (SIMD) multiply-accumulate operation, which is not directly translatable to a single particular elementwise operation overload, the so-called int8 dp2a and dp4a instructions. int8 here is somewhat misleading. It does not refer to an int8 vector type but rather a packed arrangement of 4 8-bit integer quantities in a single 32-bit word/register. Again, these are accessible via intrinsics.
16-bit floating point is natively supported via half2 vector type in cc 5.3 and higher GPUs, for certain operations.
The new Volta tensorCore is something vaguely like a SIMD-per-thread operation, but it operates (warp-wide) on a set of 16x16 input matrices producing a 16x16 matrix result.
Even with a smart OpenCL compiler that could map certain vector operations into the various operations "natively" supported by the hardware, it would not be complete coverage. There is no operational support for an 8-wide vector (e.g. uchar8) on a single core/thread, in a single instruction, to pick one example. So some serialization would be necessary. In practice, I don't think the OpenCL compiler from NVIDIA is that smart, so my expectation is that you would find such per-thread vector operations fully serialized, if you studied the machine code.
In CUDA, you could provide your own overload for certain operations and vector types, that could be represented approximately in a single instruction. For example a uchar4 add could be performed "natively" with the __vadd4() intrinsic (perhaps included in your implementation of an operator overload.) Likewise, if you are writing your own operator overload, I don't think it would be difficult to perform a uchar8 elementwise vector add using two __vadd4() instructions.
If I do operations on a uchar8 vector (for example component-wise addition), will this also map to an instruction on a single core?
AFAIK it'll always be on a single core (instructions from a single kernel / workitem don't cross cores, except special instructions like barriers), but it may be more than one instruction. This depends on whether your hardware support operations on uchar8 natively. If it does not, then uchar8 will be broken up to as many pieces as required, and each piece will be processed with a separate instruction.
OpenCL is very "generic" in the sense that it supports many different vector type/size combos, but real-world hardware usually only implements some vector type/size combinations. You can query OpenCL devices for "preferred vector size" which should tell you what's the most efficient for that hardware.

CUDA, cuBLAS and half precision data types

Since CUDA 7.5/8.0 and devices with Pascal GPUs CUDA supports the half precision (FP16) datatype out of the box. Additionally, many of the BLAS calls inside CUBLAS support the half precision types, e.g. the GEMM operation available as cublasHgemm. My problem is that the host does not support half precision types. Is the an already implemented solution like cublasSetMatrix which does the conversion during the upload to the device? Or is it necessary to create a tricky implementation by composition a float upload with a CUDA kernel doing the truncation to float?
There is no function currently provided by the CUDA toolkit which converts float quantities to half quantities in the process of copying data from host to device.
It is possible to convert from float to half either in host code or device code. There would be advantages and disadvantages of doing it in either place.
Furthermore, there is a cublas<t>gemmEx function available that may be of interest, which can have differing datatypes for input and output (and computation).
Some other half resources that may be of interest:
half conversion library by Christian Rau mentioned by #talonmies below
CUDA half intrinsics.
previous SO question, with links to some other resources

Does CUDA signbit() remove divergence?

I have seen that some people suggest that using signbit() can eliminate warp divergence and improve performance. If this is correct, then how is it implemented in the GPU? Is there some dedicated hardware for this function in, e.g., special function units (SFU)?
The implementation of signbit() is in the open in CUDA versions up to, and including, CUDA 6.5. It can be found in the header file math_functions.h. For newer versions of CUDA, you could inspect the machine code with cubobjdump --dump-sass to see how it is implemented.
Looking at the header file in CUDA 6.5, one sees that signbit() is a macro that maps to an inline function that extracts the sign bit from the raw bit representation for the floating-point operand. On GPUs this is easily doable since integer and floating-point operands share the same register file. In case of CUDA 6.5, the sign bit is extracted with a single right-shift instruction.
So the implementation of signbit() is branchless and efficient, however there is no dedicated hardware instruction for it, as this is unnecessary.
In general, CUDA programmer's do not need to worry about branches all that often, especialy when if-then-else constructs with small bodies are concerned. The compiler frequently renders these into branchless code using either predication of select-type instructions (the machine equivalent of C/C++ ternary operator). It may also combine uniform branches with predication.

Fermi architecture possible solution to my comparative study?

I am working on a comparative study in which I have to make a comparison of the serial and parallel versions of an algorithm (NSGA-II algorithm to be precise download link here). NSGA-II is a heuristic optimization method and hence depends on the initial random population generated. If the initial populations generated using the CPU and the GPU are different, then I can not make an impartial speedup study.
I possess a NVIDIA-TESLA-C1060 card which has a compute capability of 1.3. As per this anwer and this NVIDIA document, we can't expect an sm_13 device to always yield an IEEE-754 compliant float (single precision) value. Which in other word means that on my current device, I cant conduct an impartial speedup study of the CUDA program corresponding to its serial counterpart.
My question is: Would switching to Fermi architecture solve the problem?
Floating-point operations will yield different results on different architectures, regardless of whether they support IEEE754 or not, since floating-point is not associative. Even switching compiler on x86 will typically give different results. This whitepaper gives some excellent explanations.
Having said that, your real issue is that you have a data dependent algorithm where the operations are dependent on the random numbers you generate. So if you generate the same numbers on the CPU and the GPU then both runs will be following the same paths. Consider using cuRAND, which can generate the same numbers on both the CPU and GPU.

Special mathematical functions implemented in GPU hardware

I learnt today that in NVIDIA GPUs there are in the vertex unit special hardware functions for calculating linear interpolation in a 3D regular grid. I wonder if there are more of this kind and more important, if people really use them when they do GPGPU to accelerate codes
There are a number of functions that are implemented in hardware. The term you're looking for are "CUDA intrinsic functions." The linear interpolation is handled by textures, which is something similar.
See here: http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf
The intrinsic functions are typically spelled with leading double underscores, like __sin, or enabled globally with the --use_fast_math nvcc option.
And yes, they're actually used quite often. :) They are slightly more inaccurate from a numerical perspective, so passing the results of one intrinsic into another repeatedly can have unacceptable error, depending on your use case. Testing is key.