Access to CUDA library functions inside specialized instantiations of __device__ function templates - cuda

I have the following template __device__ function in CUDA:
template<typename T>
__device__ void MyatomicAdd(T *address, T val){
atomicAdd(address , val);
}
that compiles and runs just fine if instantiated with T as a float, i.e.
__global__ void myKernel(float *a, float b){
MyatomicAdd<float>(a,b);
}
will run without a problem.
I wanted to specialize this function, as there is no atomicAdd() for doubles, so I can hand code an implementation in double precision. Ignoring the double precision specialization for now, the single precision specialization and template look like this:
template<typename T>
__device__ void MyatomicAdd(T *address, T val){
};
template<>
__device__ void MyatomicAdd<float>(float *address, float val){
atomicAdd(address , val);
}
Now the compiler complains that atomicAdd() is undefined in my specialization, the same applies when I try to use any CUDA functions like __syncthreads() within the specialization. Any ideas? Thanks.

It ended up being a linking problem with some OpenGL code developed by a colleague. Forcing the specializations to be inline fixed the problem, although obviously not the root cause. Still, it'll do for now until I can be be bothered to dig through the other guy's code.

Related

Why can't we split __host__ and __device__ implementations?

If we have a __host__ __device__ function in CUDA, we can use macros to choose different code paths for host-side and device-side code in its implementations, like so:
__host__ __device__ int foo(int x)
{
#ifdef CUDA_ARCH
return x * 2;
#else
return x;
#endif
}
but why is it that we can't write:
__host__ __device__ int foo(int x);
__device__ int foo(int x) { return x * 2; }
__host__ int foo(int x) { return x; }
instead?
The Clang implementation of CUDA C++ actually supports overloading on __host__ and
__device__ because it considers the execution space qualifiers part of the function signature. Note, however, that even there, you'd have to declare the two functions separately:
__device__ int foo(int x);
__host__ int foo(int x);
__device__ int foo(int x) { return x * 2; }
__host__ int foo(int x) { return x; }
test it out here
Personally, I'm not sure how desirable/important that really is to have though. Consider that you can just define a foo(int x) in the host code outside of your CUDA source. If someone told me they need to have different implementations of the same function for host and device where the host version for some reason needs to be defined as part of the CUDA source, my initial gut feeling would be that there's likely something going in a bit of an odd direction. If the host version does something different, shouldn't it most likely have a different name? If it logically does the same thing just not using the GPU, then why does it have to be part of the CUDA source? I'd generally advocate for keeping as clean and strict a separation between host and device code as possible and keeping any host code inside the CUDA source to the bare minimum. Even if you don't care about the cleanliness of your code, doing so will at least minimize the chances of getting hurt by all the compiler magic that goes on under the hood…

Load device function from shared library with dlopen

I'm relatively new to cuda programming and can't find a solution to my problem.
I'm trying to have a shared library, lets call it func.so, that defines a device function
__device__ void hello(){ prinf("hello");}
I then want to be able to access that library via dlopen, and use that function in my programm. I tried something along the following lines:
func.cu
#include <stdio.h>
typedef void(*pFCN)();
__device__ void dhello(){
printf("hello\n")
}
__device__ pFCN ptest = dhello;
pFCN h_pFCN;
extern "C" pFCN getpointer(){
cudaMemcpyFromSymbol(&h_pFCN, ptest, sizeof(pFCN));
return h_pFCN;
}
main.cu
#include <dlfcn.h>
#include <stdio.h>
typedef void (*fcn)();
typedef fcn (*retpt)();
retpt hfcnpt;
fcn hfcn;
__device__ fcn dfcn;
__global__ void foo(){
(*dfcn)();
}
int main() {
void * m_handle = dlopen("gputest.so", RTLD_NOW);
hfcnpt = (retpt) dlsym( m_handle, "getpointer");
hfcn = (*hfcnpt)();
cudaMemcpyToSymbol(dfcn, &hfcn, sizeof(fcn), 0, cudaMemcpyHostToDevice);
foo<<<1,1>>>();
cudaThreadSynchronize();
return 0;
}
But this way I get the following error when debugging with cuda-gdb:
CUDA Exception: Warp Illegal Instruction
Program received signal CUDA_EXCEPTION_4, Warp Illegal Instruction.
0x0000000000806b30 in dtest () at func.cu:5
I appreciate any help you all can give me! :)
Calling a __device__ function in one compilation unit from device code in another compilation unit requires separate compilation with device linking usage of nvcc.
However, such usage with libraries only works with static libraries.
Therefore if the target __device__ function is in the .so library, and the calling code is outside of the .so library, your approach cannot work, with the current nvcc toolchain.
The only "workarounds" I can suggest would be to put the desired target function in a static library, or else put both caller and target inside the same .so library. There are a number of questions/answers on the cuda tag which give examples of these alternate approaches.

Function with same signature

I would like to have two versions of same member function of a class in host and device side.
Lets say
class A {
public:
double stdinvcdf(float x) {
static normal boostnormal(0, 1);
return boost::math::cdf(boostnormal,x);
}
__device__ double stdinvcdf(float x) {
return normcdfinvf(x);
}
};
But when I compile this code using nvcc, it aborts with function redefinition error.
CUDA / C++ does not support this kind of function overloading, because in the end, there is no different function signature. The common approach to have both, i.e. host and device versions is to use __host__ in combination with __device__ alongside with an #ifdef, e.g.
__host__ __device__ double stdinvcdf(float x)
{
#ifdef __CUDA_ARCH__
/* DEVICE CODE */
#else
/* HOST CODE */
#endif
}
This solution was also discussed in this thread in the NVIDIA developer forum.

Using std::vector in CUDA device code

The question is that: is there a way to use the class "vector" in Cuda kernels? When I try I get the following error:
error : calling a host function("std::vector<int, std::allocator<int> > ::push_back") from a __device__/__global__ function not allowed
So there a way to use a vector in global section?
I recently tried the following:
create a new Cuda project
go to properties of the project
open Cuda C/C++
go to Device
change the value in "Code Generation" to be set to this value:
compute_20,sm_20
........ after that I was able to use the printf standard library function in my Cuda kernel.
is there a way to use the standard library class vector in the way printf is supported in kernel code? This is an example of using printf in kernel code:
// this code only to count the 3s in an array using Cuda
//private_count is an array to hold every thread's result separately
__global__ void countKernel(int *a, int length, int* private_count)
{
printf("%d\n",threadIdx.x); //it's print the thread id and it's working
// vector<int> y;
//y.push_back(0); is there a possibility to do this?
unsigned int offset = threadIdx.x * length;
int i = offset;
for( ; i < offset + length; i++)
{
if(a[i] == 3)
{
private_count[threadIdx.x]++;
printf("%d ",a[i]);
}
}
}
You can't use the STL in CUDA, but you may be able to use the Thrust library to do what you want. Otherwise just copy the contents of the vector to the device and operate on it normally.
In the cuda library thrust, you can use thrust::device_vector<classT> to define a vector on device, and the data transfer between host STL vector and device vector is very straightforward. you can refer to this useful link:http://docs.nvidia.com/cuda/thrust/index.html to find some useful examples.
you can't use std::vector in device code, you should use array instead.
I think you can implement a device vector by youself, because CUDA supports dynamic memory alloction in device codes. Operator new/delete are also supported. Here is an extremely simple prototype of device vector in CUDA, but it does work. It hasn't been tested sufficiently.
template<typename T>
class LocalVector
{
private:
T* m_begin;
T* m_end;
size_t capacity;
size_t length;
__device__ void expand() {
capacity *= 2;
size_t tempLength = (m_end - m_begin);
T* tempBegin = new T[capacity];
memcpy(tempBegin, m_begin, tempLength * sizeof(T));
delete[] m_begin;
m_begin = tempBegin;
m_end = m_begin + tempLength;
length = static_cast<size_t>(m_end - m_begin);
}
public:
__device__ explicit LocalVector() : length(0), capacity(16) {
m_begin = new T[capacity];
m_end = m_begin;
}
__device__ T& operator[] (unsigned int index) {
return *(m_begin + index);//*(begin+index)
}
__device__ T* begin() {
return m_begin;
}
__device__ T* end() {
return m_end;
}
__device__ ~LocalVector()
{
delete[] m_begin;
m_begin = nullptr;
}
__device__ void add(T t) {
if ((m_end - m_begin) >= capacity) {
expand();
}
new (m_end) T(t);
m_end++;
length++;
}
__device__ T pop() {
T endElement = (*m_end);
delete m_end;
m_end--;
return endElement;
}
__device__ size_t getSize() {
return length;
}
};
You can't use std::vector in device-side code. Why?
It's not marked to allow this
The "formal" reason is that, to use code in your device-side function or kernel, that code itself has to be in a __device__ function; and the code in the standard library, including, std::vector is not. (There's an exception for constexpr code; and in C++20, std::vector does have constexpr methods, but CUDA does not support C++20 at the moment, plus, that constexprness is effectively limited.)
You probably don't really want to
The std::vector class uses allocators to obtain more memory when it needs to grow the storage for the vectors you create or add into. By default (i.e. if you use std::vector<T> for some T) - that allocation is on the heap. While this could be adapted to the GPU - it would be quite slow, and incredibly slow if each "CUDA thread" would dynamically allocate its own memory.
#Now, you could say "But I don't want to allocate memory, I just want to read from the vector!" - well, in that case, you don't need a vector per se. Just copy the data to some on-device buffer, and either pass a pointer and a size, or use a CUDA-capable span, like in cuda-kat. Another option, though a bit "heavier", is to use the [NVIDIA thrust library]'s 3 "device vector" class. Under the hood, it's quite different from the standard library vector though.

CUDA function call from anther cu file

I have two cuda files say A and B.
I need to call a function from A to B like..
__device__ int add(int a, int b) //this is a function in A
{
return a+b;
}
__device__ void fun1(int a, int b) //this is a function in B
{
int c = A.add(a,b);
}
How can I do this??
Can I use static keyword? Please give me an example..
The short answer is that you can't. CUDA only supports internal linkage, thus everything needed to compile a kernel must be defined within the same translation unit.
What you might be able to do is put the functions into a header file like this:
// Both functions in func.cuh
#pragma once
__device__ inline int add(int a, int b)
{
return a+b;
}
__device__ inline void fun1(int a, int b)
{
int c = add(a,b);
}
and include that header file into each .cu file you need to use the functions. The CUDA built chain seems to honour the inline keyword and that sort of declaration won't generate duplicate symbols on any of the CUDA platforms I use (which doesn't include Windows). I am not sure whether it is intended to work or not, so cavaet emptor.
I think meanwhile there is a possibilty to solve it:
CUDA external class linkage and unresolved extern function in ptxas file
You can enable "Generate Relocateable Device Code" in VS Project Properies->CUDA C/C++->Common or use compiler parameter -rdc=true.