__host__ __device__ functions calling overloaded functions - cuda

I do not understand whether there is function overloading in Cuda or not. I want to explain my problem on the following two functions, which I want to be able to use both on the GPU and the CPU, and I don't care about precision:
__host__ __device__
float myabs( float v ) {
return abs( v + 1 ); //I want the floating point absolute value
}
__host__ __device__
float mycos( float v ) {
return 2.f*cos( v );
}
Which function of abs, resp. cos should I call, and why?
std::abs / abs / fabs / fabsf / anythingelse
std::cos / cos / cosf / __cosf / anythingelse
(Since __cosf is a Cuda-intrinsic and std::abs/std::cos are not available in Cuda, I assume I have to use preprocessor directives inside my functions for those choices.)
Which headers should I include?
Does the answer to the first two questions depend on whether I compile with fast-math flags (e.g. -ffast-math).
If this important for the answer, I am compiling with nvcc 10.2 under Ubuntu 18.04.4., but I am rather interrested in a platform independent answer.

Which function of abs, resp. cos should I call, and why?
If you are using floating point arguments, then conventionally you would use fabs and cosf. Those are the standard CUDA Math API implementations (and they correspond to the names of equivalent C standard library functions).
Which headers should I include?
Conventionally you should include either math.h or cmath
Does the answer to the first two questions depend on whether I compile with fast-math flags (e.g. -ffast-math).
No. Neither of those functions will be substituted for fast intrinsics by fast math.

Related

Elementwise vector Multiplication in cublas [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

CUDA: Is there any api for element wise vector product in cublas? [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

sprintf-like function for CUDA device-side code?

I could not find anything in internet. Due to the fact that it is possible to use printf in a __device__ function I am wondering if there is a sprintf like function due to the fact that printf is "using" the result from sprintf to be displayed in stdout.
No there isn't anything built into CUDA for this.
Within CUDA the implementation of device printf is a special case and does not use the same mechanisms as the C library printf.
sprintf(), snprintf() and additional printf()-family functions are now available on the development branch of the CUDA Kernel Author's Toolkit, a.k.a. cuda-kat. Signatures:
namespace kat {
__device__ int sprintf(char* s, const char* format, ...);
__device__ int snprintf(char* s, size_t n, const char* format, ...);
}
... and they do exactly what you would expect. In particular, they support the C standard features which CUDA printf() does not, and then some (e.g. specifying a string argument's field width using an extra argument; format specifiers for size_t, and ptrdiff_t, and printing in base-2).
Caveat: I am the author of cuda-kat, so I'm biased...
Always prefer snprintf() which takes the buffer size oversprintf() which might overflow.

using memcmp from within device code CUDA [duplicate]

I'm going to run on GPU for example a strcmp function, but I get:
error: calling a host function("strcmp") from a __device__/__global__ function("myKernel") is not allowed
It's possible that printf won't work because gpu hasn't got stdout, but functions like strcmp are expected to work! So, I should insert in my code the implement of strcmp from the library with __device__ prefix or what?
CUDA has a standard library, documented in the CUDA programming guide. It includes printf() for devices that support it (Compute Capability 2.0 and higher), as well as assert(). It does not include a complete string or stdio library at this point, however.
Implementing your own standard library as Jason R. Mick suggests may be possible, but it is not necessarily advisable. In some cases, it may be unsafe to naively port functions from the sequential standard library to CUDA -- not least because some of these implementations are not meant to be thread safe (rand() on Windows, for example). Even if it is safe, it might not be efficient -- and it might not really be what you need.
In my opinion, you are better off avoiding standard library functions in CUDA that are not officially supported. If you need the behavior of a standard library function in your parallel code, first consider whether you really need it:
* Are you really going to do thousands of strcmp operations in parallel?
* If not, do you have strings to compare that are many thousands of characters long? If so, consider a parallel string comparison algorithm instead.
If you determine that you really do need the behavior of the standard library function in your parallel CUDA code, then consider how you might implement it (safely and efficiently) in parallel.
Hope this will help atleast one person:
Since strcmp function is not available in CUDA, so we have to implement on our own:
__device__ int my_strcmp (const char * s1, const char * s2) {
for(; *s1 == *s2; ++s1, ++s2)
if(*s1 == 0)
return 0;
return *(unsigned char *)s1 < *(unsigned char *)s2 ? -1 : 1;
}

How to run "host" functions on GPU with CUDA?

I'm going to run on GPU for example a strcmp function, but I get:
error: calling a host function("strcmp") from a __device__/__global__ function("myKernel") is not allowed
It's possible that printf won't work because gpu hasn't got stdout, but functions like strcmp are expected to work! So, I should insert in my code the implement of strcmp from the library with __device__ prefix or what?
CUDA has a standard library, documented in the CUDA programming guide. It includes printf() for devices that support it (Compute Capability 2.0 and higher), as well as assert(). It does not include a complete string or stdio library at this point, however.
Implementing your own standard library as Jason R. Mick suggests may be possible, but it is not necessarily advisable. In some cases, it may be unsafe to naively port functions from the sequential standard library to CUDA -- not least because some of these implementations are not meant to be thread safe (rand() on Windows, for example). Even if it is safe, it might not be efficient -- and it might not really be what you need.
In my opinion, you are better off avoiding standard library functions in CUDA that are not officially supported. If you need the behavior of a standard library function in your parallel code, first consider whether you really need it:
* Are you really going to do thousands of strcmp operations in parallel?
* If not, do you have strings to compare that are many thousands of characters long? If so, consider a parallel string comparison algorithm instead.
If you determine that you really do need the behavior of the standard library function in your parallel CUDA code, then consider how you might implement it (safely and efficiently) in parallel.
Hope this will help atleast one person:
Since strcmp function is not available in CUDA, so we have to implement on our own:
__device__ int my_strcmp (const char * s1, const char * s2) {
for(; *s1 == *s2; ++s1, ++s2)
if(*s1 == 0)
return 0;
return *(unsigned char *)s1 < *(unsigned char *)s2 ? -1 : 1;
}