cuModuleGetFunction don't accepts simple kernel names, only ".entry"-tags from .ptx-file - cuda

I convert my .cu-files using CUDA_COMPILE_PTX from findPackageCUDA.cmake. When I try to get the function-pointers to my kernels I am facing the following problem:
My kernel named Kernel1 only can be loaded correctly via cuModuleGetFunction if I use its .entry-label from the resulting .ptx-file, e.g. _Z7Kernel1Pj
The problem is that this label may change each time I have to recompile my .cu-files. This can't be a solution if I reference them by name in a constant char*.

_Z7Kernel1Pj is a C++ mangled name. If you want to have a simple symbol you can use extern "C"
extern "C" void Kernel1(...)
For example if you use the default CUDA visual studio project contains the kernel
__global__ void addKernel(int *c, const int *a, const int *b)
If you run cuobjdump -symbols on this you will see the mangled symbol name
STT_FUNC STB_GLOBAL _Z9addKernelPiPKiS1_
If you use extern "C"
extern "C" __global__ void addKernel(int *c, const int *a, const int *b)
the symbol name will now be
STT_FUNC STB_GLOBAL addKernel
Using extern "C" will result in loss of function overloading and namespaces

Related

Why this redundancy in Swig interface file?

I see here that the swig interface file has identical declarations at 2 places (in and after %{ }% part).
/* example.i */
%module example
%{
/* Put header files here or function declarations like below */
extern double My_variable;
extern int fact(int n);
extern int my_mod(int x, int y);
extern char *get_time();
%}
extern double My_variable;
extern int fact(int n);
extern int my_mod(int x, int y);
extern char *get_time();
Why is this redundancy needed?
The code between %{ and %} is directly injected in the generated SWIG wrapper code to make the declarations available to the compiler. For example you could have a utility function in this block that's only visible to the compiler but not exposed to the target language, or add #include statements needed to compile the wrapper.
Code outside that block is parsed by SWIG and exposed to the target language. If you left out one of the declarations in this section it wouldn't be exposed to the target language.
You can eliminate the redundancy in this case by using %inline, which both injects the code into the wrapper and parses it for exposure to the target language:
%module example
%inline %{
extern double My_variable;
extern int fact(int n);
extern int my_mod(int x, int y);
extern char *get_time();
%}

nvcc warns about a device variable being a host variable - why?

I've been reading in the CUDA Programming Guide about template functions and is something like this working?
#include <cstdio>
/* host struct */
template <typename T>
struct Test {
T *val;
int size;
};
/* struct device */
template <typename T>
__device__ Test<T> *d_test;
/* test function */
template <typename T>
T __device__ testfunc() {
return *d_test<T>->val;
}
/* test kernel */
__global__ void kernel() {
printf("funcout = %g \n", testfunc<float>());
}
I get the correct result but a warning:
"warning: a host variable "d_test [with T=T]" cannot be directly read in a device function" ?
Has the struct in the testfunction to be instantiated with *d_test<float>->val ?
KR,
Iggi
Unfortunately, the CUDA compiler seems to generally have some issues with variable templates. If you look at the assembly, you'll see that everything works just fine. The compiler clearly does instantiate the variable template and allocates a corresponding device object.
.global .align 8 .u64 _Z6d_testIfE;
The generated code uses this object just like it's supposed to
ld.global.u64 %rd3, [_Z6d_testIfE];
I'd consider this warning a compiler bug. Note that I cannot reproduce the issue with CUDA 10 here, so this issue has most likely been fixed by now. Consider updating your compiler…
#MichaelKenzel is correct.
This is almost certainly an nvcc bug - which I have now filed (you might need an account to access that.
Also note I've been able to reproduce the issue with less code:
template <typename T>
struct foo { int val; };
template <typename T>
__device__ foo<T> *x;
template <typename T>
int __device__ f() { return x<T>->val; }
__global__ void kernel() { int y = f<float>(); }
and have a look at the result on GodBolt as well.

operator overloading in Cuda

I successfully created an operator+ between two float4 by doing :
__device__ float4 operator+(float4 a, float4 b) {
// ...
}
However, if in addition, I want to have an operator+ for uchar4, by doing the same thing with uchar4, i get the following error:
"error: more than one instance of overloaded function "operator+" has "C" linkage" "
I get a similar error message when I declare multiple functions with the same name but different arguments.
So, two questions :
Polymorphism : Is-it possible to have multiple functions with the same name and different arguments in Cuda ? If so, why do I have this error message ?
operator+ for float4 : it seems that this feature is already included by including "cutil_math.h", but when I include that (#include <cutil_math.h>) it complains that there is no such file or directory... anything particular I should do ? Note: I am using pycuda, which is a cuda for python.
Thanks!
Note the "has "C" linkage" in the error. You are compiling your code with C linkage (pyCUDA does this by default to circumvent symbol mangling issues). C++ can't support multiple definitions of the same function name using C linkage.
The solution is to compile code without automatically generated "extern C", and explicitly specify C linkage only for kernels. So your code would looks something like:
__device__ float4 operator+(float4 a, float4 b) { ... };
extern "C"
__global__ void kernel() { };
rather than the standard pyCUDA emitted:
extern "C"
{
__device__ float4 operator+(float4 a, float4 b) { ... };
__global__ void kernel() { };
}
pycuda.compiler.SourceModule has an option no_extern_c which can be used to control whether extern "C" is emitted by the just in time compilation system or not.

CUDA: How to apply __restrict__ on array of pointers to arrays?

This kernel using two __restrict__ int arrays compiles fine:
__global__ void kerFoo( int* __restrict__ arr0, int* __restrict__ arr1, int num )
{
for ( /* Iterate over array */ )
arr1[i] = arr0[i]; // Copy one to other
}
However, the same two int arrays composed into a pointer array fails compilation:
__global__ void kerFoo( int* __restrict__ arr[2], int num )
{
for ( /* Iterate over array */ )
arr[1][i] = arr[0][i]; // Copy one to other
}
The error given by the compiler is:
error: invalid use of `restrict'
I have certain structures that are composed as an array of pointers to arrays. (For example, a struct passed to the kernel that has int* arr[16].) How do I pass them to kernels and be able to apply __restrict__ on them?
The CUDA C manual only refers to the C99 definition of __restrict__, no special CUDA-specific circumstances.
Since the indicated parameter is an array containing two pointers, this use of __restrict__ looks perfectly valid to me, no reason for the compiler to complain IMHO. I would ask the compiler author to verify and possibly/probably correct the issue. I'd be interested in different opinions, though.
One remark to #talonmies:
The whole point of restrict is to tell the compiler that two or more pointer arguments will never overlap in memory.
This is not strictly true. restrict tells the compiler that the pointer in question, for the duration of its lifetime, is the only pointer through which the pointed-to object can be accessed. Be aware that the object pointed to is only assumed to be an array of int. (In truth it's only one int in this case.) Since the compiler cannot know the size of the array, it is up to the programmer to guard the array's boundaries..
Filling in the comment in your code with some arbitrary iteration, we get the following program:
__global__ void kerFoo( int* __restrict__ arr[2], int num )
{
for ( int i = 0; i < 1024; i ++)
arr[1][i] = arr[0][i]; // Copy one to other
}
and this compiles fine with CUDA 10.1 (Godbolt.org).

C++: Explicit DLL Loading: First-chance Exception on non "extern C" functions

I am having trouble importing my C++ functions. If I declare them as C functions I can successfully import them. When explicit loading, if any of the functions are missing the extern as C decoration I get a the following exception:
First-chance exception at 0x00000000 in cpp.exe: 0xC0000005: Access violation.
DLL.h:
extern "C" __declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
DLL.cpp:
#include "DLL.h"
int addC(int a, int b) {
return a + b;
}
int addCpp(int a, int b) {
return a + b;
}
main.cpp:
#include "..DLL/DLL.h"
#include <stdio.h>
#include <windows.h>
int main() {
int a = 2;
int b = 1;
typedef int (*PFNaddC)(int,int);
typedef int (*PFNaddCpp)(int,int);
HMODULE hDLL = LoadLibrary(TEXT("../Debug/DLL.dll"));
if (hDLL != NULL)
{
PFNaddC pfnAddC = (PFNaddC)GetProcAddress(hDLL, "addC");
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "addCpp");
printf("a=%d, b=%d\n", a,b);
printf("pfnAddC: %d\n", pfnAddC(a,b));
printf("pfnAddCpp: %d\n", pfnAddCpp(a,b)); //EXCEPTION ON THIS LINE
}
getchar();
return 0;
}
How can I import c++ functions for dynamic loading? I have found that the following code works with implicit loading by referencing the *.lib, but I would like to learn about dynamic loading.
Thank you to all in advance.
Update:
bindump /exports
1 00011109 ?addCpp##YAHHH#Z = #ILT+260(?addCpp##YAHHH#Z)
2 00011136 addC = #ILT+305(_addC)
Solution:
Create a conversion struct as
found here
Take a look at the
file exports and copy explicitly the
c++ mangle naming convention.
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "?addCpp##YAHHH#Z");
Inevitably, the access violation on the null pointer is because GetProcAddress() returns null on error.
The problem is that C++ names are mangled by the compiler to accommodate a variety of C++ features (namespaces, classes, and overloading, among other things). So, your function addCpp() is not really named addCpp() in the resulting library. When you declare the function with extern "C", you give up overloading and the option of putting the function in a namespace, but in return you get a function whose name is not mangled, and which you can call from C code (which doesn't know anything about name mangling.)
One option to get around this is to export the functions using a .def file to rename the exported functions. There's an article, Explicitly Linking to Classes in DLLs, that describes what is necessary to do this.
It's possible to just wrap a whole header file in extern "C" as follows. Then you don't need to worry about forgetting an extern "C" on one of your declarations.
#ifdef __cplusplus
extern "C" {
#endif
__declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
#ifdef __cplusplus
} /* extern "C" */
#endif
You can still use all of the C++ features that you're used to in the function bodies -- these functions are still C++ functions -- they just have restrictions on the prototypes to make them compatible with C code.