cuda nvcc make __device__ conditional - cuda

I am trying to add a cuda backend to a 20k loc c++ expression template library. So far it is working great, but i am drowned in completely bogus "warning: calling a __host__ function from a __host__ __device__ function is not allowed" warnings.
Most of the code can be summarized like this:
template<class Impl>
struct Wrapper{
Impl impl;
// lots and lots of decorator code
__host__ __device__ void call(){ impl.call();};
};
//Guaranteed to never ever be used on gpu.
struct ImplCPU{
void call();
};
//Guaranteed to never ever be used on cpu.
struct ImplGPU{
__host__ __device__ void call();//Actually only __device__, but needed to shut up the compiler as well
};
Wrapper<ImplCPU> wrapCPU;
Wrapper<ImplGPU> wrapGPU;
In all cases, call() in Wrapper is trivial, while the wrapper itself is a rather complicated beast (only host-functions containing meta-information).
conditional compilation is not an option, both paths are intended to be used side by side.
I am one step short of "--disable-warnings", because honestly the cost of copying and maintaining 10k loc of horrible template magic outweighs the benefits of warnings.
I would be super happy about a way to have call being device or host conditionally based on whether the implementation is for gpu or cpu(because Impl knows what it is for)
Just to show bad it is. A single warning:
/home/user/Remora/include/remora/detail/matrix_expression_classes.hpp(859): warning: calling a __host__ function from a __host__ __device__ function is not allowed
detected during:
instantiation of "remora::matrix_matrix_prod<MatA, MatB>::size_type remora::matrix_matrix_prod<MatA, MatB>::size1() const [with MatA=remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, MatB=remora::matrix<float, remora::column_major, remora::hip_tag>]"
/home/user/Remora/include/remora/cpu/../assignment.hpp(258): here
instantiation of "MatA &remora::assign(remora::matrix_expression<MatA, Device> &, const remora::matrix_expression<MatB, Device> &) [with MatA=remora::dense_matrix_adaptor<float, remora::row_major, remora::continuous_dense_tag, remora::hip_tag>, MatB=remora::matrix_matrix_prod<remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, remora::matrix<float, remora::column_major, remora::hip_tag>>, Device=remora::hip_tag]"
/home/user/Remora/include/remora/cpu/../assignment.hpp(646): here
instantiation of "remora::noalias_proxy<C>::closure_type &remora::noalias_proxy<C>::operator=(const E &) [with C=remora::matrix<float, remora::row_major, remora::hip_tag>, E=remora::matrix_matrix_prod<remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, remora::matrix<float, remora::column_major, remora::hip_tag>>]"
/home/user/Remora/Test/hip_triangular_prod.cpp(325): here
instantiation of "void Remora_hip_triangular_prod::triangular_prod_matrix_matrix_test(Orientation) [with Orientation=remora::row_major]"
/home/user/Remora/Test/hip_triangular_prod.cpp(527): here

This problem is actually quite unfortunate deficiency in CUDA language extensions.
Standard approach to deal with these warnings (in Thrust and similar templated CUDA libs) is to disable the warning for the function/method that causes it by using #pragma hd_warning_disable, or in newer CUDA (9.0 or newer) #pragma nv_exec_check_disable.
So in your case it would be:
template<class Impl>
struct Wrapper{
Impl impl;
// lots and lots of decorator code
#pragma nv_exec_check_disable
__host__ __device__ void call(){ impl.call();};
};
Similar question already asked

I'm sorry, but you're abusing the language and misleading readers. It is not true that your wrapper classes has a __host__ __device__ method; what you mean to say is that it has either a __host__ method or a __device__ method. You should treat the warning as more an error.
So, you can't just use the sample template instantiation for ImplCPU and ImplGPU; but - you could do something like this?
template<typename Impl> struct Wrapper;
template<> struct Wrapper<ImplGPU> {
ImplGPU impl;
__device__ void call(){ impl.call();};
}
template<> struct Wrapper<ImplCPU> {
ImplGPU impl;
__host__ void call(){ impl.call();};
}
or if you want to be more pedantic, it could be:
enum implementation_device { CPU, GPU };
template<implementation_device ImplementationDevice> Wrapper;
template<> Wrapper<CPU> {
__host__ void call();
}
template<> Wrapper<GPU> {
__device__ void call();
}
Having said that - you were expecting to use a single Wrapper class, and here I am telling you that you can't do that. I suspect your question presents an X-Y problem, and you should really consider the entire approach of using that wrapper. Perhaps you need to have the code which uses it templated different for CPU or GPU. Perhaps you need type erasure somewhere. But this won't do.

The solution i came up in the mean-time with far far less code duplication is to replace call by a functor level:
template<class Impl, class Device>
struct WrapperImpl;
template<class Impl>
struct WrapperImpl<Impl, CPU>{
typename Impl::Functor f;
__host__ operator()(){ f();}
};
//identical to CPU up to __device__
template<class Impl>
struct WrapperImpl<Impl, GPU>{
typename Impl::Functor f;
__device__ operator()(){ f();}
};
template<class Impl>
struct Wrapper{
typedef WrapperImpl<Impl, typename Impl::Device> Functor;
Impl impl;
// lots and lots of decorator code that i now do not need to duplicate
Functor call_functor()const{
return Functor{impl.call_functor();};
}
};
//repeat for around 20 classes
Wrapper<ImplCPU> wrapCPU;
wrapCPU.call_functor()();

Related

exposing nonstatic member function of class to chaiscript

I have a project that tries to implement keyboard macro scripting with chaiscript. I am writing a class based on xlib to wrap the xlib code.
I have a member function to add a modifier key to an ignored list, because of a xlib quirk.
how could i do something like the following minimal example.
#include <chaiscript/chaiscript.hpp>
#include <functional>
class MacroEngine{
public:
MacroEngine() = default;
//...
void addIgnoredMod(int modifier){
ignoredMods |= modifier;
}
//...
private:
int ignoredMods;
};
int main(int argc, char *argv[]){
MacroEngine me;
chaiscript::ChaiScript chai;
//...
chai.add(chaiscript::fun(std::bind(&MacroEngine::addIgnoredMod, me, std::placeholders::_1)), "setIgnoredMods");
//...
return 0;
}
I tried bind and failed with the following error message:
In file included from ../deps/ChaiScript/include/chaiscript/dispatchkit/proxy_functions_detail.hpp:24:0,
from ../deps/ChaiScript/include/chaiscript/dispatchkit/proxy_functions.hpp:27,
from ../deps/ChaiScript/include/chaiscript/dispatchkit/proxy_constructors.hpp:14,
from ../deps/ChaiScript/include/chaiscript/dispatchkit/dispatchkit.hpp:34,
from ../deps/ChaiScript/include/chaiscript/chaiscript_basic.hpp:12,
from ../deps/ChaiScript/include/chaiscript/chaiscript.hpp:823,
from ../src/main.cpp:2:
../deps/ChaiScript/include/chaiscript/dispatchkit/callable_traits.hpp: In instantiation of ‘struct chaiscript::dispatch::detail::Callable_Traits<std::_Bind<void (MacroEngine::*(MacroEngine, std::_Placeholder<1>))(unsigned int)> >’:
../deps/ChaiScript/include/chaiscript/language/../dispatchkit/register_function.hpp:45:72: required from ‘chaiscript::Proxy_Function chaiscript::fun(const T&) [with T = std::_Bind<void (MacroEngine::*(MacroEngine, std::_Placeholder<1>))(unsigned int)>; chaiscript::Proxy_Function = std::shared_ptr<chaiscript::dispatch::Proxy_Function_Base>]’
../src/main.cpp:21:95: required from here
../deps/ChaiScript/include/chaiscript/dispatchkit/callable_traits.hpp:99:84: error: decltype cannot resolve address of overloaded function
typedef typename Function_Signature<decltype(&T::operator())>::Signature Signature;
^~~~~~~~~
../deps/ChaiScript/include/chaiscript/dispatchkit/callable_traits.hpp:100:86: error: decltype cannot resolve address of overloaded function
typedef typename Function_Signature<decltype(&T::operator())>::Return_Type Return_Type;
^~~~~~~~~~~
I also tried to make the variable static which worked, but it wont work if I try to make it possible to ignore modifiers on a per hotkey basis.
what am i doing wrong? and how can I fix it?
You can do this instead:
chai.add(chaiscript::fun(&MacroEngine::addIgnoredMod, &me), "addIgnoredMod");
Or use a lambda:
chai.add(chaiscript::fun([&me](int modifier){ me.addIgnoredMod(modifier); }), "addIgnoredMod");
Jason Turner, the creator of Chaiscript, commented on it here: http://discourse.chaiscript.com/t/may-i-use-std-bind/244/4
"There’s really never any good reason to use std::bind. I much better solution is to use a lambda (and by much better, I mean much much better. std::bind adds to compile size, compile time and runtime)."

Can I use vararg functions in CUDA device-side code?

I know that we can't write CUDA kernels with a variable number of parameters:
Is it possible to have a CUDA kernel with varying number of parameters?
(at least not in the C varargs sense; we can use C++ variadic templates.)
But what about non-kernel device-side code, i.e. __device__ functions? Can these be varargs functions?
Yes, we can write varargs device-side functions.
For example:
#include <stdio.h>
#include <stdarg.h>
__device__ void foo(const char* str, ...)
{
va_list ap;
va_start(ap, str);
int arg = va_arg(ap, int); // making an assumption here
printf("str is \"%s\", first va_list argument is %d\n", str, arg);
}
This compiles fine with NVCC - and works, provided you actually pass a null-terminated string and an int. I would not be surprised if CUDA's printf() itself were implemented this way.

nvcc warns about a device variable being a host variable - why?

I've been reading in the CUDA Programming Guide about template functions and is something like this working?
#include <cstdio>
/* host struct */
template <typename T>
struct Test {
T *val;
int size;
};
/* struct device */
template <typename T>
__device__ Test<T> *d_test;
/* test function */
template <typename T>
T __device__ testfunc() {
return *d_test<T>->val;
}
/* test kernel */
__global__ void kernel() {
printf("funcout = %g \n", testfunc<float>());
}
I get the correct result but a warning:
"warning: a host variable "d_test [with T=T]" cannot be directly read in a device function" ?
Has the struct in the testfunction to be instantiated with *d_test<float>->val ?
KR,
Iggi
Unfortunately, the CUDA compiler seems to generally have some issues with variable templates. If you look at the assembly, you'll see that everything works just fine. The compiler clearly does instantiate the variable template and allocates a corresponding device object.
.global .align 8 .u64 _Z6d_testIfE;
The generated code uses this object just like it's supposed to
ld.global.u64 %rd3, [_Z6d_testIfE];
I'd consider this warning a compiler bug. Note that I cannot reproduce the issue with CUDA 10 here, so this issue has most likely been fixed by now. Consider updating your compiler…
#MichaelKenzel is correct.
This is almost certainly an nvcc bug - which I have now filed (you might need an account to access that.
Also note I've been able to reproduce the issue with less code:
template <typename T>
struct foo { int val; };
template <typename T>
__device__ foo<T> *x;
template <typename T>
int __device__ f() { return x<T>->val; }
__global__ void kernel() { int y = f<float>(); }
and have a look at the result on GodBolt as well.

Is there a shorthand for __host__ and __device__ in CUDA?

I have many functions in my cuda C program with both __host__ and __device__ modifiers. I'm looking for a shorthand for these two to make my code look neat by changing
__host__ __device__ void foo() {
}
to
__both__ void foo() {
}
.
Of course I can
#define __both__ __host__ __device__
, but if something like this already existed, I would prefer to use the existing solution.
No there is no short hand for this in CUDA C/C++ currently.

operator overloading in Cuda

I successfully created an operator+ between two float4 by doing :
__device__ float4 operator+(float4 a, float4 b) {
// ...
}
However, if in addition, I want to have an operator+ for uchar4, by doing the same thing with uchar4, i get the following error:
"error: more than one instance of overloaded function "operator+" has "C" linkage" "
I get a similar error message when I declare multiple functions with the same name but different arguments.
So, two questions :
Polymorphism : Is-it possible to have multiple functions with the same name and different arguments in Cuda ? If so, why do I have this error message ?
operator+ for float4 : it seems that this feature is already included by including "cutil_math.h", but when I include that (#include <cutil_math.h>) it complains that there is no such file or directory... anything particular I should do ? Note: I am using pycuda, which is a cuda for python.
Thanks!
Note the "has "C" linkage" in the error. You are compiling your code with C linkage (pyCUDA does this by default to circumvent symbol mangling issues). C++ can't support multiple definitions of the same function name using C linkage.
The solution is to compile code without automatically generated "extern C", and explicitly specify C linkage only for kernels. So your code would looks something like:
__device__ float4 operator+(float4 a, float4 b) { ... };
extern "C"
__global__ void kernel() { };
rather than the standard pyCUDA emitted:
extern "C"
{
__device__ float4 operator+(float4 a, float4 b) { ... };
__global__ void kernel() { };
}
pycuda.compiler.SourceModule has an option no_extern_c which can be used to control whether extern "C" is emitted by the just in time compilation system or not.