CUDA function call from anther cu file - cuda

I have two cuda files say A and B.
I need to call a function from A to B like..
__device__ int add(int a, int b) //this is a function in A
{
return a+b;
}
__device__ void fun1(int a, int b) //this is a function in B
{
int c = A.add(a,b);
}
How can I do this??
Can I use static keyword? Please give me an example..

The short answer is that you can't. CUDA only supports internal linkage, thus everything needed to compile a kernel must be defined within the same translation unit.
What you might be able to do is put the functions into a header file like this:
// Both functions in func.cuh
#pragma once
__device__ inline int add(int a, int b)
{
return a+b;
}
__device__ inline void fun1(int a, int b)
{
int c = add(a,b);
}
and include that header file into each .cu file you need to use the functions. The CUDA built chain seems to honour the inline keyword and that sort of declaration won't generate duplicate symbols on any of the CUDA platforms I use (which doesn't include Windows). I am not sure whether it is intended to work or not, so cavaet emptor.

I think meanwhile there is a possibilty to solve it:
CUDA external class linkage and unresolved extern function in ptxas file
You can enable "Generate Relocateable Device Code" in VS Project Properies->CUDA C/C++->Common or use compiler parameter -rdc=true.

Related

Why is CUDA function cudaLaunchKernel passed a function pointer to host-code function?

I compile axpy.cu with the following command.
nvcc --cuda axpy.cu -o axpy.cu.cpp.ii
Within axpy.cu.cpp.ii, I observe that function cudaLaunchKernel nested in __device_stub__Z4axpyfPfS_ accepts an function pointer to axpy which is defined in axpy.cu.cpp.ii. So my confuse is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy? Why is there function definition with the same name as kernel function? Any help would be appreciated! Thanks in advance
Both of functions are shown below.
void __device_stub__Z4axpyfPfS_(float __par0, float *__par1, float *__par2){
void * __args_arr[3];
int __args_idx = 0;
__args_arr[__args_idx] = (void *)(char *)&__par0;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par1;
++__args_idx;
__args_arr[__args_idx] = (void *)(char *)&__par2;
++__args_idx;
{
volatile static char *__f __attribute__((unused));
__f = ((char *)((void ( *)(float, float *, float *))axpy));
dim3 __gridDim, __blockDim;
size_t __sharedMem;
cudaStream_t __stream;
if (__cudaPopCallConfiguration(&__gridDim, &__blockDim, &__sharedMem, &__stream) != cudaSuccess)
return;
if (__args_idx == 0) {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[__args_idx], __sharedMem, __stream);
} else {
(void)cudaLaunchKernel(((char *)((void ( *)(float, float *, float *))axpy)), __gridDim, __blockDim, &__args_arr[0], __sharedMem, __stream);
}
};
}
void axpy( float __cuda_0,float *__cuda_1,float *__cuda_2)
# 3 "axpy.cu"
{
__device_stub__Z4axpyfPfS_( __cuda_0,__cuda_1,__cuda_2);
}
So my confuse [sic] is that shouldn't cudaLaunchKernel have been passed an function pointer to kernel function axpy?
No, because that isn't the design which NVIDIA chose and your assumption about how this function works are probably not correct. As I understand it, the first argument to cudaLaunchKernel is treated as a key, not a function pointer which is called. Elsewhere in the nvcc emitted code you will find something like this boilerplate:
static void __nv_cudaEntityRegisterCallback( void **__T0)
{
__nv_dummy_param_ref(__T0);
__nv_save_fatbinhandle_for_managed_rt(__T0);
__cudaRegisterEntry(__T0, ((void ( *)(float, float *, float *))axpy), _Z4axpyfPfS_, (-1));
}
You can see that the __cudaRegisterEntry call takes both a static pointer to axpy and a form the mangled symbol for the compiled GPU kernel. __cudaRegisterEntry is an internal, completely undocumented API from the CUDA runtime API. Many years ago, I satisfied myself by reverse engineering an earlier version of the CUDA runtime API, that there is an internal lookup mechanism which allows the correct instance of a host kernel stub to be matched to the correct instance of the compiled GPU code at runtime. The compiler emits a large amount of boilerplate and statically defined objects holding all the necessary definitions to make the runtime API work seamlessly without all of the additional API overhead that you need to use in the CUDA driver API or comparable compute APIs like OpenCL.
Why is there function definition with the same name as kernel function?
Because that is how NVIDIA decided to implement the runtime API. What you are asking about are undocumented, internal implementation details of the runtime API. You as a programmer are not supposed to have to see or use them, and they are not guaranteed to be the same from version to version.
If, as indicated in comments, you want to implement some additional compiler infrastructure in the CUDA compilation process, use the CUDA driver API, not the runtime API.

Does Cuda C++ not have tuples in device code?

__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = std::make_tuple(1, 2, 3); // Does not work.
c[i] = a[i] + b[i];
}
NVCC has lambdas at least, but std::make_tuple fails to compile. Are tuples not allowed in the current version of Cuda?
I've just tried this out and tuple metaprogramming with std:: (std::tuple, std::get, etc ...) will work in device code with C++14 and expt-relaxed-constexpr enabled (CUDA8+) during compilation (e.g. nvcc -std=c++14 xxxx.cu -o yyyyy --expt-relaxed-constexpr) - CUDA 9 required for C++14, but basic std::tuple should work in CUDA 8 if you are limited to that. Thrust/tuple works but has some drawbacks: limited to 10 items and lacking in some of the std::tuple helper functions (e.g. std::tuple_cat). Because tuples and their related functions are compile-time, expt-relaxed-constexpr should enable your std::tuple to "just work".
#include <tuple>
__global__ void kernel()
{
auto t = std::make_tuple(1, 2, 3);
printf("%d\n",std::get<0>(t));
}
int main()
{
kernel<<<1,1>>>();
cudaDeviceSynchronize();
}
#include <thrust/tuple.h>
__global__ void addKernel(int *c, const int *a, const int *b)
{
int i = threadIdx.x;
auto lamb = [](int x) {return x + 1; }; // Works.
auto t = thrust::make_tuple(1, 2, 3);
c[i] = a[i] + b[i];
}
I needed to get the ones from the Thrust library instead to make them work it seems. The above does compile.
Support for the standard c++ library on device side is problematic for CUDA as the standard library does not have the necessary __host__ or __device__ annotations.
That said, both clang and nvcc do have partial support for some functionality. Usually it's limited to constexpr functions that are considered to be __host__ __device__ if you pass --expt-relaxed-constexpr to nvcc (or by default in clang). Clang also has a bit more support for standard math functions. Neither supports anything that relies on C++ runtime (except for memory allocation, printf and assert) as that does not exist on device side.
So, in short -- most of the standard C++ library is unusable on device side in CUDA, though things do slowly improve as more and more functions in the standard library become constexpr.
Indeed, CUDA itself does not offer a device-side-capable version of std::tuple. However, I have a full tuple implementation as part of my cuda-kat library (still very much under initial development at the time of writing). thrust's tuple class is limited in the following senses:
Limited to 10 tuple elements.
Recursively expands templated types for every tuple element.
No/partial support for rvalues (e.g. in get())
The tuple implementation in cuda-kat is an adaptation of the EASTL tuple, which in turn is an adaptation of the LLVM project's libc++ tuple. Unlike the EASTL's, however, it is C++11-compatible, so you don't have to have the absolute latest CUDA version. It is possible to extract only the tuple class from the library with oh, I think 4 files or so, if you need just that.

Load device function from shared library with dlopen

I'm relatively new to cuda programming and can't find a solution to my problem.
I'm trying to have a shared library, lets call it func.so, that defines a device function
__device__ void hello(){ prinf("hello");}
I then want to be able to access that library via dlopen, and use that function in my programm. I tried something along the following lines:
func.cu
#include <stdio.h>
typedef void(*pFCN)();
__device__ void dhello(){
printf("hello\n")
}
__device__ pFCN ptest = dhello;
pFCN h_pFCN;
extern "C" pFCN getpointer(){
cudaMemcpyFromSymbol(&h_pFCN, ptest, sizeof(pFCN));
return h_pFCN;
}
main.cu
#include <dlfcn.h>
#include <stdio.h>
typedef void (*fcn)();
typedef fcn (*retpt)();
retpt hfcnpt;
fcn hfcn;
__device__ fcn dfcn;
__global__ void foo(){
(*dfcn)();
}
int main() {
void * m_handle = dlopen("gputest.so", RTLD_NOW);
hfcnpt = (retpt) dlsym( m_handle, "getpointer");
hfcn = (*hfcnpt)();
cudaMemcpyToSymbol(dfcn, &hfcn, sizeof(fcn), 0, cudaMemcpyHostToDevice);
foo<<<1,1>>>();
cudaThreadSynchronize();
return 0;
}
But this way I get the following error when debugging with cuda-gdb:
CUDA Exception: Warp Illegal Instruction
Program received signal CUDA_EXCEPTION_4, Warp Illegal Instruction.
0x0000000000806b30 in dtest () at func.cu:5
I appreciate any help you all can give me! :)
Calling a __device__ function in one compilation unit from device code in another compilation unit requires separate compilation with device linking usage of nvcc.
However, such usage with libraries only works with static libraries.
Therefore if the target __device__ function is in the .so library, and the calling code is outside of the .so library, your approach cannot work, with the current nvcc toolchain.
The only "workarounds" I can suggest would be to put the desired target function in a static library, or else put both caller and target inside the same .so library. There are a number of questions/answers on the cuda tag which give examples of these alternate approaches.

A function calls another function in CUDA C++

I have a problem with CUDA programing !
Input is a matrix A( 2 x 2 )
Ouput is a matrix A( 2 x 2 ) with every new value is **3 exponent of the old value **
example :
input : A : { 2,2 } output : A { 8,8 }
{ 2,2 } { 8,8 }
I have 2 function in file CudaCode.CU :
__global__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__global__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
and Kernel :
__global__ void CudaProcessingKernel(int *dataA ) //kernel function
{
int bx = blockIdx.x;
int tx = threadIdx.x;
int tid = bx * XTHREADS + tx;
if(tid < 16)
{
Power_of_03(dataA[tid]);
}
__syncthreads();
}
I think it's right, but the error appear : calling a __global__ function("Power_of_02") from a __global__ function("Power_of_03") is only allowed on the compute_35 architecture or above
Why I wrong ? How to repair it ?
The error is fairly explanatory. A CUDA function decorated with __global__ represents a kernel. Kernels can be launched from host code. On cc 3.5 or higher GPUs, you can also launch a kernel from device code. So if you call a __global__ function from device code (i.e. from another CUDA function that is decorated with __global__ or __device__), then you must be compiling for the appropriate architecture. This is called CUDA dynamic parallelism, and you should read the documentation to learn how to use it, if you want to use it.
When you launch a kernel, whether from host or device code, you must provide a launch configuration, i.e. the information between the triple-chevron notation:
CudaProcessingKernel<<<grid, threads>>>(d_A);
If you want to use your power-of-2 code from another kernel, you will need to call it in a similar, appropriate fashion.
Based on the structure of your code, however, it seems like you can make things work by declaring your power-of-2 and power-of-3 functions as __device__ functions:
__device__ void Power_of_02(int &a)
{
a=a*a;
}
//***************
__device__ void Power_of_03(int &a)
{
int tempt = a;
Power_of_02(a); //a=a^2;
a= a*tempt; // a = a^3
}
This should probably work for you and perhaps was your intent. Functions decorated with __device__ are not kernels (so they are not callable directly from host code) but are callable directly from device code on any architecture. The programming guide will also help to explain the difference.

Function with same signature

I would like to have two versions of same member function of a class in host and device side.
Lets say
class A {
public:
double stdinvcdf(float x) {
static normal boostnormal(0, 1);
return boost::math::cdf(boostnormal,x);
}
__device__ double stdinvcdf(float x) {
return normcdfinvf(x);
}
};
But when I compile this code using nvcc, it aborts with function redefinition error.
CUDA / C++ does not support this kind of function overloading, because in the end, there is no different function signature. The common approach to have both, i.e. host and device versions is to use __host__ in combination with __device__ alongside with an #ifdef, e.g.
__host__ __device__ double stdinvcdf(float x)
{
#ifdef __CUDA_ARCH__
/* DEVICE CODE */
#else
/* HOST CODE */
#endif
}
This solution was also discussed in this thread in the NVIDIA developer forum.