nvcc warns about a device variable being a host variable - why? - cuda

I've been reading in the CUDA Programming Guide about template functions and is something like this working?
#include <cstdio>
/* host struct */
template <typename T>
struct Test {
T *val;
int size;
};
/* struct device */
template <typename T>
__device__ Test<T> *d_test;
/* test function */
template <typename T>
T __device__ testfunc() {
return *d_test<T>->val;
}
/* test kernel */
__global__ void kernel() {
printf("funcout = %g \n", testfunc<float>());
}
I get the correct result but a warning:
"warning: a host variable "d_test [with T=T]" cannot be directly read in a device function" ?
Has the struct in the testfunction to be instantiated with *d_test<float>->val ?
KR,
Iggi

Unfortunately, the CUDA compiler seems to generally have some issues with variable templates. If you look at the assembly, you'll see that everything works just fine. The compiler clearly does instantiate the variable template and allocates a corresponding device object.
.global .align 8 .u64 _Z6d_testIfE;
The generated code uses this object just like it's supposed to
ld.global.u64 %rd3, [_Z6d_testIfE];
I'd consider this warning a compiler bug. Note that I cannot reproduce the issue with CUDA 10 here, so this issue has most likely been fixed by now. Consider updating your compiler…

#MichaelKenzel is correct.
This is almost certainly an nvcc bug - which I have now filed (you might need an account to access that.
Also note I've been able to reproduce the issue with less code:
template <typename T>
struct foo { int val; };
template <typename T>
__device__ foo<T> *x;
template <typename T>
int __device__ f() { return x<T>->val; }
__global__ void kernel() { int y = f<float>(); }
and have a look at the result on GodBolt as well.

Related

cuda nvcc make __device__ conditional

I am trying to add a cuda backend to a 20k loc c++ expression template library. So far it is working great, but i am drowned in completely bogus "warning: calling a __host__ function from a __host__ __device__ function is not allowed" warnings.
Most of the code can be summarized like this:
template<class Impl>
struct Wrapper{
Impl impl;
// lots and lots of decorator code
__host__ __device__ void call(){ impl.call();};
};
//Guaranteed to never ever be used on gpu.
struct ImplCPU{
void call();
};
//Guaranteed to never ever be used on cpu.
struct ImplGPU{
__host__ __device__ void call();//Actually only __device__, but needed to shut up the compiler as well
};
Wrapper<ImplCPU> wrapCPU;
Wrapper<ImplGPU> wrapGPU;
In all cases, call() in Wrapper is trivial, while the wrapper itself is a rather complicated beast (only host-functions containing meta-information).
conditional compilation is not an option, both paths are intended to be used side by side.
I am one step short of "--disable-warnings", because honestly the cost of copying and maintaining 10k loc of horrible template magic outweighs the benefits of warnings.
I would be super happy about a way to have call being device or host conditionally based on whether the implementation is for gpu or cpu(because Impl knows what it is for)
Just to show bad it is. A single warning:
/home/user/Remora/include/remora/detail/matrix_expression_classes.hpp(859): warning: calling a __host__ function from a __host__ __device__ function is not allowed
detected during:
instantiation of "remora::matrix_matrix_prod<MatA, MatB>::size_type remora::matrix_matrix_prod<MatA, MatB>::size1() const [with MatA=remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, MatB=remora::matrix<float, remora::column_major, remora::hip_tag>]"
/home/user/Remora/include/remora/cpu/../assignment.hpp(258): here
instantiation of "MatA &remora::assign(remora::matrix_expression<MatA, Device> &, const remora::matrix_expression<MatB, Device> &) [with MatA=remora::dense_matrix_adaptor<float, remora::row_major, remora::continuous_dense_tag, remora::hip_tag>, MatB=remora::matrix_matrix_prod<remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, remora::matrix<float, remora::column_major, remora::hip_tag>>, Device=remora::hip_tag]"
/home/user/Remora/include/remora/cpu/../assignment.hpp(646): here
instantiation of "remora::noalias_proxy<C>::closure_type &remora::noalias_proxy<C>::operator=(const E &) [with C=remora::matrix<float, remora::row_major, remora::hip_tag>, E=remora::matrix_matrix_prod<remora::dense_triangular_proxy<const float, remora::row_major, remora::lower, remora::hip_tag>, remora::matrix<float, remora::column_major, remora::hip_tag>>]"
/home/user/Remora/Test/hip_triangular_prod.cpp(325): here
instantiation of "void Remora_hip_triangular_prod::triangular_prod_matrix_matrix_test(Orientation) [with Orientation=remora::row_major]"
/home/user/Remora/Test/hip_triangular_prod.cpp(527): here
This problem is actually quite unfortunate deficiency in CUDA language extensions.
Standard approach to deal with these warnings (in Thrust and similar templated CUDA libs) is to disable the warning for the function/method that causes it by using #pragma hd_warning_disable, or in newer CUDA (9.0 or newer) #pragma nv_exec_check_disable.
So in your case it would be:
template<class Impl>
struct Wrapper{
Impl impl;
// lots and lots of decorator code
#pragma nv_exec_check_disable
__host__ __device__ void call(){ impl.call();};
};
Similar question already asked
I'm sorry, but you're abusing the language and misleading readers. It is not true that your wrapper classes has a __host__ __device__ method; what you mean to say is that it has either a __host__ method or a __device__ method. You should treat the warning as more an error.
So, you can't just use the sample template instantiation for ImplCPU and ImplGPU; but - you could do something like this?
template<typename Impl> struct Wrapper;
template<> struct Wrapper<ImplGPU> {
ImplGPU impl;
__device__ void call(){ impl.call();};
}
template<> struct Wrapper<ImplCPU> {
ImplGPU impl;
__host__ void call(){ impl.call();};
}
or if you want to be more pedantic, it could be:
enum implementation_device { CPU, GPU };
template<implementation_device ImplementationDevice> Wrapper;
template<> Wrapper<CPU> {
__host__ void call();
}
template<> Wrapper<GPU> {
__device__ void call();
}
Having said that - you were expecting to use a single Wrapper class, and here I am telling you that you can't do that. I suspect your question presents an X-Y problem, and you should really consider the entire approach of using that wrapper. Perhaps you need to have the code which uses it templated different for CPU or GPU. Perhaps you need type erasure somewhere. But this won't do.
The solution i came up in the mean-time with far far less code duplication is to replace call by a functor level:
template<class Impl, class Device>
struct WrapperImpl;
template<class Impl>
struct WrapperImpl<Impl, CPU>{
typename Impl::Functor f;
__host__ operator()(){ f();}
};
//identical to CPU up to __device__
template<class Impl>
struct WrapperImpl<Impl, GPU>{
typename Impl::Functor f;
__device__ operator()(){ f();}
};
template<class Impl>
struct Wrapper{
typedef WrapperImpl<Impl, typename Impl::Device> Functor;
Impl impl;
// lots and lots of decorator code that i now do not need to duplicate
Functor call_functor()const{
return Functor{impl.call_functor();};
}
};
//repeat for around 20 classes
Wrapper<ImplCPU> wrapCPU;
wrapCPU.call_functor()();

Why NVCC is unable to match function definition to an existing declaration?

I have this files that produce the following error when compiling with nvcc
error C2244: 'TemplateClass<N>::print': unable to match function definition to an existing declaration
note: see declaration of 'TemplateClass<N>::print'
note: definition note: 'void TemplateClass<N>::print(const std::string [])'
note: existing declarations note: 'void TemplateClass<N>::print(const std::string [N])'
Template.h
#pragma once
#include <string>
#include <iostream>
template <unsigned int N>
class TemplateClass
{
private:
std::string name;
public:
TemplateClass();
TemplateClass(const std::string& name);
void print(const std::string familyName[N]);
};
#include "template.inl"
Template.inl
template <unsigned int N>
TemplateClass<N>::TemplateClass()
{
name = "Unknown";
}
template <unsigned int N>
TemplateClass<N>::TemplateClass(const std::string& name)
{
this->name = name;
}
template <unsigned int N>
void TemplateClass<N>::print(const std::string familyName[N])
{
std::cout << "My name is " << name << " ";
for (auto i = 0; i < N; i++)
std::cout << familyName[i] << " ";
std::cout << std::endl;
}
consume_template.cu
#include "template.h"
void consume_template_gpu()
{
TemplateClass<3> obj("aname");
std::string namesf[3];
namesf[0] = "un";
namesf[1] = "deux";
namesf[2] = "trois";
obj.print(namesf);
}
I am using VS2017 15.4.5, later versions failed to create the project with CMake.
The project was created with CMake like this
cmake_minimum_required(VERSION 3.10)
project(template_inl_file LANGUAGES CXX CUDA)
set (lib_files template.h consume_template.cu)
add_library(template_inl_file_lib ${lib_files})
Just out of curiosity , try using std::string namesf = {"un","deux","trois"}; It seems like a compiler issue. Trying different formats might help compiler to understand better. Otherwise code seems to be ok.
Maybe you're missing some linkage with CMake. Also try compiling straight from VS2017 without using CMake by creating a CUDA project.
What's happening is that the array is decaying into a pointer and the size of the array is lost during compilation. So this
template <unsigned int N>
void TemplateClass<N>::print(const std::string familyName[N]);
will be actually turned into this
template <unsigned int N>
void TemplateClass<N>::print(const std::string* familyName);
as we can see there is no way for the compiler to know that it has to generate different functions depending on the size of the array (i.e. the template parameter N).
To solve this we can use an old trick to avoid array decay like this
template <unsigned int N>
void TemplateClass<N>::print(const std::string (&familyName)[N]);
Now the size N is present through the compilation and the compiler knows there are different functions to be generated. I guess, as pointed out in the comments of the question that NVCC produces code that VS does not produce by itself and then it does not know how to handle it.
More info on the topic on the following links
http://pointer-overloading.blogspot.ch/2013/09/c-template-argument-deduction-to-deduce.html
http://en.cppreference.com/w/cpp/language/template_argument_deduction
https://theotherbranch.wordpress.com/2011/08/24/template-parameter-deduction-from-array-dimensions/

Is there a shorthand for __host__ and __device__ in CUDA?

I have many functions in my cuda C program with both __host__ and __device__ modifiers. I'm looking for a shorthand for these two to make my code look neat by changing
__host__ __device__ void foo() {
}
to
__both__ void foo() {
}
.
Of course I can
#define __both__ __host__ __device__
, but if something like this already existed, I would prefer to use the existing solution.
No there is no short hand for this in CUDA C/C++ currently.

How can I use templates or typedefs to select between float/double vector types?

A usual way to target different floating point precisions (float / double) is either by typedefs
typedef float Real;
//typedef double Real;
or by using templates
template<typename Real>
...
This is convenient, but anyone has ideas how to use the CUDA types float2/float3/... and make_float2/make_float3/... ? Sure, I could make #defines or typedefs for all of them but that seems not very elegant.
You can implement helper class that will concatenate type and channels number:
template <typename T, int cn> struct MakeVec;
template <> struct MakeVec<float, 3>
{
typedef float3 type;
};
template <> struct MakeVec<double, 3>
{
typedef double3 type;
};
// and so on for all combination of T and cn
Usage:
template <typename T>
void func()
{
typedef typename MakeVec<T, 4>::type vec4_type;
vec4_type vec4; // for T=float it will be float4, for T=double it will be double4
}
You can find implementation here

C++: Explicit DLL Loading: First-chance Exception on non "extern C" functions

I am having trouble importing my C++ functions. If I declare them as C functions I can successfully import them. When explicit loading, if any of the functions are missing the extern as C decoration I get a the following exception:
First-chance exception at 0x00000000 in cpp.exe: 0xC0000005: Access violation.
DLL.h:
extern "C" __declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
DLL.cpp:
#include "DLL.h"
int addC(int a, int b) {
return a + b;
}
int addCpp(int a, int b) {
return a + b;
}
main.cpp:
#include "..DLL/DLL.h"
#include <stdio.h>
#include <windows.h>
int main() {
int a = 2;
int b = 1;
typedef int (*PFNaddC)(int,int);
typedef int (*PFNaddCpp)(int,int);
HMODULE hDLL = LoadLibrary(TEXT("../Debug/DLL.dll"));
if (hDLL != NULL)
{
PFNaddC pfnAddC = (PFNaddC)GetProcAddress(hDLL, "addC");
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "addCpp");
printf("a=%d, b=%d\n", a,b);
printf("pfnAddC: %d\n", pfnAddC(a,b));
printf("pfnAddCpp: %d\n", pfnAddCpp(a,b)); //EXCEPTION ON THIS LINE
}
getchar();
return 0;
}
How can I import c++ functions for dynamic loading? I have found that the following code works with implicit loading by referencing the *.lib, but I would like to learn about dynamic loading.
Thank you to all in advance.
Update:
bindump /exports
1 00011109 ?addCpp##YAHHH#Z = #ILT+260(?addCpp##YAHHH#Z)
2 00011136 addC = #ILT+305(_addC)
Solution:
Create a conversion struct as
found here
Take a look at the
file exports and copy explicitly the
c++ mangle naming convention.
PFNaddCpp pfnAddCpp = (PFNaddCpp)GetProcAddress(hDLL, "?addCpp##YAHHH#Z");
Inevitably, the access violation on the null pointer is because GetProcAddress() returns null on error.
The problem is that C++ names are mangled by the compiler to accommodate a variety of C++ features (namespaces, classes, and overloading, among other things). So, your function addCpp() is not really named addCpp() in the resulting library. When you declare the function with extern "C", you give up overloading and the option of putting the function in a namespace, but in return you get a function whose name is not mangled, and which you can call from C code (which doesn't know anything about name mangling.)
One option to get around this is to export the functions using a .def file to rename the exported functions. There's an article, Explicitly Linking to Classes in DLLs, that describes what is necessary to do this.
It's possible to just wrap a whole header file in extern "C" as follows. Then you don't need to worry about forgetting an extern "C" on one of your declarations.
#ifdef __cplusplus
extern "C" {
#endif
__declspec(dllimport) int addC(int a, int b);
__declspec(dllimport) int addCpp(int a, int b);
#ifdef __cplusplus
} /* extern "C" */
#endif
You can still use all of the C++ features that you're used to in the function bodies -- these functions are still C++ functions -- they just have restrictions on the prototypes to make them compatible with C code.