operator overloading in Cuda - cuda

I successfully created an operator+ between two float4 by doing :
__device__ float4 operator+(float4 a, float4 b) {
// ...
}
However, if in addition, I want to have an operator+ for uchar4, by doing the same thing with uchar4, i get the following error:
"error: more than one instance of overloaded function "operator+" has "C" linkage" "
I get a similar error message when I declare multiple functions with the same name but different arguments.
So, two questions :
Polymorphism : Is-it possible to have multiple functions with the same name and different arguments in Cuda ? If so, why do I have this error message ?
operator+ for float4 : it seems that this feature is already included by including "cutil_math.h", but when I include that (#include <cutil_math.h>) it complains that there is no such file or directory... anything particular I should do ? Note: I am using pycuda, which is a cuda for python.
Thanks!

Note the "has "C" linkage" in the error. You are compiling your code with C linkage (pyCUDA does this by default to circumvent symbol mangling issues). C++ can't support multiple definitions of the same function name using C linkage.
The solution is to compile code without automatically generated "extern C", and explicitly specify C linkage only for kernels. So your code would looks something like:
__device__ float4 operator+(float4 a, float4 b) { ... };
extern "C"
__global__ void kernel() { };
rather than the standard pyCUDA emitted:
extern "C"
{
__device__ float4 operator+(float4 a, float4 b) { ... };
__global__ void kernel() { };
}
pycuda.compiler.SourceModule has an option no_extern_c which can be used to control whether extern "C" is emitted by the just in time compilation system or not.

Related

cuda nvcc make __device__ conditional

I am trying to add a cuda backend to a 20k loc c++ expression template library. So far it is working great, but i am drowned in completely bogus "warning: calling a __host__ function from a __host__ __device__ function is not allowed" warnings.
Most of the code can be summarized like this:
template<class Impl>
struct Wrapper{
Impl impl;
// lots and lots of decorator code
__host__ __device__ void call(){ impl.call();};
};
//Guaranteed to never ever be used on gpu.
struct ImplCPU{
void call();
};
//Guaranteed to never ever be used on cpu.
struct ImplGPU{
__host__ __device__ void call();//Actually only __device__, but needed to shut up the compiler as well
};
Wrapper<ImplCPU> wrapCPU;
Wrapper<ImplGPU> wrapGPU;
In all cases, call() in Wrapper is trivial, while the wrapper itself is a rather complicated beast (only host-functions containing meta-information).
conditional compilation is not an option, both paths are intended to be used side by side.
I am one step short of "--disable-warnings", because honestly the cost of copying and maintaining 10k loc of horrible template magic outweighs the benefits of warnings.
I would be super happy about a way to have call being device or host conditionally based on whether the implementation is for gpu or cpu(because Impl knows what it is for)
Just to show bad it is. A single warning:
/home/user/Remora/include/remora/detail/matrix_expression_classes.hpp(859): warning: calling a __host__ function from a __host__ __device__ function is not allowed
detected during:
instantiation of "remora::matrix_matrix_prod<MatA, MatB>::size_type remora::matrix_matrix_prod<MatA, MatB>::size1() const [with MatA=remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, MatB=remora::matrix<float, remora::column_major, remora::hip_tag>]"
/home/user/Remora/include/remora/cpu/../assignment.hpp(258): here
instantiation of "MatA &remora::assign(remora::matrix_expression<MatA, Device> &, const remora::matrix_expression<MatB, Device> &) [with MatA=remora::dense_matrix_adaptor<float, remora::row_major, remora::continuous_dense_tag, remora::hip_tag>, MatB=remora::matrix_matrix_prod<remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, remora::matrix<float, remora::column_major, remora::hip_tag>>, Device=remora::hip_tag]"
/home/user/Remora/include/remora/cpu/../assignment.hpp(646): here
instantiation of "remora::noalias_proxy<C>::closure_type &remora::noalias_proxy<C>::operator=(const E &) [with C=remora::matrix<float, remora::row_major, remora::hip_tag>, E=remora::matrix_matrix_prod<remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, remora::matrix<float, remora::column_major, remora::hip_tag>>]"
/home/user/Remora/Test/hip_triangular_prod.cpp(325): here
instantiation of "void Remora_hip_triangular_prod::triangular_prod_matrix_matrix_test(Orientation) [with Orientation=remora::row_major]"
/home/user/Remora/Test/hip_triangular_prod.cpp(527): here
This problem is actually quite unfortunate deficiency in CUDA language extensions.
Standard approach to deal with these warnings (in Thrust and similar templated CUDA libs) is to disable the warning for the function/method that causes it by using #pragma hd_warning_disable, or in newer CUDA (9.0 or newer) #pragma nv_exec_check_disable.
So in your case it would be:
template<class Impl>
struct Wrapper{
Impl impl;
// lots and lots of decorator code
#pragma nv_exec_check_disable
__host__ __device__ void call(){ impl.call();};
};
Similar question already asked
I'm sorry, but you're abusing the language and misleading readers. It is not true that your wrapper classes has a __host__ __device__ method; what you mean to say is that it has either a __host__ method or a __device__ method. You should treat the warning as more an error.
So, you can't just use the sample template instantiation for ImplCPU and ImplGPU; but - you could do something like this?
template<typename Impl> struct Wrapper;
template<> struct Wrapper<ImplGPU> {
ImplGPU impl;
__device__ void call(){ impl.call();};
}
template<> struct Wrapper<ImplCPU> {
ImplGPU impl;
__host__ void call(){ impl.call();};
}
or if you want to be more pedantic, it could be:
enum implementation_device { CPU, GPU };
template<implementation_device ImplementationDevice> Wrapper;
template<> Wrapper<CPU> {
__host__ void call();
}
template<> Wrapper<GPU> {
__device__ void call();
}
Having said that - you were expecting to use a single Wrapper class, and here I am telling you that you can't do that. I suspect your question presents an X-Y problem, and you should really consider the entire approach of using that wrapper. Perhaps you need to have the code which uses it templated different for CPU or GPU. Perhaps you need type erasure somewhere. But this won't do.
The solution i came up in the mean-time with far far less code duplication is to replace call by a functor level:
template<class Impl, class Device>
struct WrapperImpl;
template<class Impl>
struct WrapperImpl<Impl, CPU>{
typename Impl::Functor f;
__host__ operator()(){ f();}
};
//identical to CPU up to __device__
template<class Impl>
struct WrapperImpl<Impl, GPU>{
typename Impl::Functor f;
__device__ operator()(){ f();}
};
template<class Impl>
struct Wrapper{
typedef WrapperImpl<Impl, typename Impl::Device> Functor;
Impl impl;
// lots and lots of decorator code that i now do not need to duplicate
Functor call_functor()const{
return Functor{impl.call_functor();};
}
};
//repeat for around 20 classes
Wrapper<ImplCPU> wrapCPU;
wrapCPU.call_functor()();

how can i execute a host class function in a CUDA kernel

I have a genetic algorithm and i'm traying to evaluate a population of chromosome on GPU :
class chromosome
{
int fitness;
int gene(int pos) { .... };
};
class eval
{
public :
__global__ doEval(Chromosome *population)
{
....
int jobid = population[tid].gene(X);
population[tid].fitness = Z;
....
}
};
int main()
{
Chromosome *dev_population;
Eval eval;
eval.doEval<<<1,N>>>(dev_population);
}
and i have this errors :
ga3.cu(121): warning: inline qualifier ignored for "global" function
ga3.cu(121): error: illegal combination of memory qualifiers
ga3.cu(323): error: a pointer to a bound function may only be used to call the function
ga3.cu(398): warning: nested comment is not allowed
where are the problems ?
i remove Eval class and left only doEval function , and make device host gene() , like this :
\__device\__ \__host\__ gene()
{....};
\__global\__ doEval(Chromosome *population)
{
....
int jobid = population[tid].gene(X);
population[tid].fitness = Z;
....
}
int main()
{
Chromosome *dev_population;
doEval<<<1,N>>>(dev_population);
}
but now i have have other errors , and it's not compile :
/usr/include/c++/4.6/iomanip(66): error: expected an expression
/usr/include/c++/4.6/iomanip(96): error: expected an expression
/usr/include/c++/4.6/iomanip(127): error: expected an expression
/usr/include/c++/4.6/iomanip(195): error: expected an expression
/usr/include/c++/4.6/iomanip(225): error: expected an expression
5 errors detected in the compilation of "/tmp/tmpxft_00006fe9_00000000-4_ga3.cpp1.ii".
There are two problems here, one soluble, the other one not.
It is illegal in CUDA for a __global__ function (ie. kernel) to be defined as a class member function. So doEval can never be defined as a member of eval. You are free to call a kernel in a structure or class member function, but a kernel cannot be a member function. You will have to redesign this class, there is no work around.
Any function called device code must be explicitly denoted as a device function and be instantiated and compiled for the device. This applies to both regular functions and class member functions. All functions are treated by nvcc as host functions unless identified as otherwise. You can, therefore, fix this error by doing something like the following:
class chromosome
{
int fitness;
__device__ __host__ int gene(int pos) { .... };
};
Note that every function called by gene must also have a valid device definition for the code to successfully compile.

Writing a simple thrust functor operating on some zipped arrays

I am trying to perform a thrust::reduce_by_key using zip and permutation iterators.
i.e. doing this on a zipped array of several 'virtual' permuted arrays.
I am having trouble in writing the syntax for the functor density_update.
But first the setup of the problem.
Here is my function call:
thrust::reduce_by_key( dflagt,
dflagtend,
thrust::make_zip_iterator(
thrust::make_tuple(
thrust::make_permutation_iterator(dmasst, dmapt),
thrust::make_permutation_iterator(dvelt, dmapt),
thrust::make_permutation_iterator(dmasst, dflagt),
thrust::make_permutation_iterator(dvelt, dflagt)
)
),
thrust::make_discard_iterator(),
danswert,
thrust::equal_to<int>(),
density_update()
)
dmapt, dflagt are of type thrust::device_ptr<int> and dvelt , dmasst and danst are of type
thrust::device_ptr<double>.
(They are thrust wrappers to my raw cuda arrays)
The arrays mapt and flagt are both index vectors from which I need to perform a gather operation from the arrays dmasst and dvelt.
After the reduction step I intend to write my data to the danswert array. Since multiple arrays are being used in the reduction, obviously I am using zip iterators.
My problem lies in writing the functor density_update which is binary operation.
struct density_update
{
typedef thrust::device_ptr<double> ElementIterator;
typedef thrust::device_ptr<int> IndexIterator;
typedef thrust::permutation_iterator<ElementIterator,IndexIterator> PIt;
typedef thrust::tuple< PIt , PIt , PIt, PIt> Tuple;
__host__ __device__
double operator()(const Tuple& x , const Tuple& y)
{
return thrust::get<0>(*x) * (thrust::get<1>(*x) - thrust::get<3>(*x)) + \
thrust::get<0>(*y) * (thrust::get<1>(*y) - thrust::get<3>(*y));
}
};
The value being returned is a double . Why the binary operation looks like the above functor is
not important. I just want to know how I would go about correcting the above syntactically.
As shown above the code is throwing a number of compilation errors. I am not sure where I have gone wrong.
I am using CUDA 4.0 on GTX 570 on Ubuntu 10.10
density_update should not receive tuples of iterators as parameters -- it needs tuples of the iterators' references.
In principle you could write density_update::operator() in terms of the particular reference type of the various iterators, but it's simpler to have the compiler infer the type of the parameters:
struct density_update
{
template<typename Tuple>
__host__ __device__
double operator()(const Tuple& x, const Tuple& y)
{
return thrust::get<0>(x) * (thrust::get<1>(x) - thrust::get<3>(x)) + \
thrust::get<0>(y) * (thrust::get<1>(y) - thrust::get<3>(y));
}
};

Cannot overload make_uint4 function

I'm trying to overload make_uint4 in the following manner:
namespace A {
namespace B {
inline __host__ __device__ uint4 make_uint4(uint2 a, uint2 b) {
return make_uint4(a.x, a.y, b.x, b.y);
}
}
}
But when I try to compile it, nvcc returns an error:
error: no suitable constructor exists to convert from "unsigned int" to "uint2"
error: no suitable constructor exists to convert from "unsigned int" to "uint2"
error: too many arguments in function call
All these errors point to the "return…" line.
I was able to get a partial repro on VS 2010 and CUDA 4.0 (the compiler built the code OK but Intellisense flagged the error you are seeing). Try the following:
#include "vector_functions.h"
inline __host__ __device__ uint4 make_uint4(uint2 a, uint2 b)
{
return ::make_uint4(a.x, a.y, b.x, b.y);
}
This fixed it for me.
I have no problem compiling it in Visual Studio+nvcc. What compiler are you using?
If that would be of any help: make_uint4 is defined in vector_functions.h, line 170 as
static __inline__ __host__ __device__ uint4 make_uint4(unsigned int x, unsigned int y, unsigned int z, unsigned int w)
{
uint4 t; t.x = x; t.y = y; t.z = z; t.w = w; return t;
}
Update:
I get similar error when I try to overload the function while being inside my custom namespace. Are you certain you are not inside one? If so, try putting :: in front of function call to refer to global scope, i.e:
return ::make_uint4(a.x, a.y, b.x, b.y);
I don't have the library code, but it seems like the compiler doesn't like overloaded device functions (as they are treated just like really fancy inline macros). What is does is shadow (hide) the old make_uint4(a,b,c,d) with your new make_uint4(va, vb) and try to call the latter with 4 uint parameters. That doesn't work because there is no conversion from uint to uint2 (as indicated by the first two error messages) and there are 4 instead of 2 arguments (the last error message).
Use a slightly different function name like make_uint4_from_uint2s and you'll be fine.

C++: Explicit DLL Loading: First-chance Exception on non "extern C" functions

I am having trouble importing my C++ functions. If I declare them as C functions I can successfully import them. When explicit loading, if any of the functions are missing the extern as C decoration I get a the following exception:
First-chance exception at 0x00000000 in cpp.exe: 0xC0000005: Access violation.
DLL.h:
extern "C" __declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
DLL.cpp:
#include "DLL.h"
int addC(int a, int b) {
return a + b;
}
int addCpp(int a, int b) {
return a + b;
}
main.cpp:
#include "..DLL/DLL.h"
#include <stdio.h>
#include <windows.h>
int main() {
int a = 2;
int b = 1;
typedef int (*PFNaddC)(int,int);
typedef int (*PFNaddCpp)(int,int);
HMODULE hDLL = LoadLibrary(TEXT("../Debug/DLL.dll"));
if (hDLL != NULL)
{
PFNaddC pfnAddC = (PFNaddC)GetProcAddress(hDLL, "addC");
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "addCpp");
printf("a=%d, b=%d\n", a,b);
printf("pfnAddC: %d\n", pfnAddC(a,b));
printf("pfnAddCpp: %d\n", pfnAddCpp(a,b)); //EXCEPTION ON THIS LINE
}
getchar();
return 0;
}
How can I import c++ functions for dynamic loading? I have found that the following code works with implicit loading by referencing the *.lib, but I would like to learn about dynamic loading.
Thank you to all in advance.
Update:
bindump /exports
1 00011109 ?addCpp##YAHHH#Z = #ILT+260(?addCpp##YAHHH#Z)
2 00011136 addC = #ILT+305(_addC)
Solution:
Create a conversion struct as
found here
Take a look at the
file exports and copy explicitly the
c++ mangle naming convention.
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "?addCpp##YAHHH#Z");
Inevitably, the access violation on the null pointer is because GetProcAddress() returns null on error.
The problem is that C++ names are mangled by the compiler to accommodate a variety of C++ features (namespaces, classes, and overloading, among other things). So, your function addCpp() is not really named addCpp() in the resulting library. When you declare the function with extern "C", you give up overloading and the option of putting the function in a namespace, but in return you get a function whose name is not mangled, and which you can call from C code (which doesn't know anything about name mangling.)
One option to get around this is to export the functions using a .def file to rename the exported functions. There's an article, Explicitly Linking to Classes in DLLs, that describes what is necessary to do this.
It's possible to just wrap a whole header file in extern "C" as follows. Then you don't need to worry about forgetting an extern "C" on one of your declarations.
#ifdef __cplusplus
extern "C" {
#endif
__declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
#ifdef __cplusplus
} /* extern "C" */
#endif
You can still use all of the C++ features that you're used to in the function bodies -- these functions are still C++ functions -- they just have restrictions on the prototypes to make them compatible with C code.