Passing an object with virtual functions to a CUDA kernel [duplicate] - cuda

It seems that Cuda does not allow me to "pass an object of a class derived from virtual base classes to __global__ function", for some reason related to "virtual table" or "virtual pointer".
I wonder is there some way for me to setup the "virtual pointer" manually, so that I can use the polymorphism?

Is There Any Way To Copy vtable From Host To Device
You wouldn't want to copy the vtable from host to device. The vtable on the host (i.e. in an object created on the host) has a set of host function pointers in the vtable. When you copy such an object to the device, the vtable doesn't get changed or "fixed up", and so you end up with an object on the device, whose vtable is full of host pointers.
If you then try and call one of those virtual functions (using the object on the device, from device code), bad things happen. The numerical function entry points listed in the vtable are addresses that don't make any sense in device code.
so that I can use the polymorphism
My recommendation for a way to use polymorphism in device code is to create the object on the device. This sets up the vtable with a set of device function pointers, rather than host function pointers, and questions such as this demonstrate that it works. To a first order approximation, if you have a way to create a set of polymorphic objects in host code, I don't know of any reason why you shouldn't be able to use a similar method in device code. The issue really has to do with interoperability - moving such objects between host and device - which is what the stated limitations in the programming guide are referring to.
I wonder is there some way for me to setup the "virtual pointer" manully
There might be. In the interest of sharing knowledge, I will outline a method. However, I don't know C++ well enough to say whether this is acceptable/legal. The only thing I can say is in my very limited testing, it appears to work. But I would assume it is not legal and so I do not recommend you use this method for anything other than experimentation. Even if we don't resolve whether or not it is legal, there is already a stated CUDA limitation (as indicated above) that you should not attempt to pass objects with virtual functions between host and device. So I offer it merely as an observation, which may be interesting for experimentation or research. I don't suggest it for production code.
The basic idea is outlined in this thread. It is predicated on the idea that an ordinary object-copy does not seem to copy the virtual function pointer table, which makes sense to me, but that the object as a whole does contain the table. Therefore if we use a method like this:
template<typename T>
__device__ void fixVirtualPointers(T *other) {
T temp = T(*other); // object-copy moves the "guts" of the object w/o changing vtable
memcpy(other, &temp, sizeof(T)); // pointer copy seems to move vtable
}
it seems to be possible to take a given object, create a new "dummy" object of that type, and then "fix up" the vtable by doing a pointer-based copy of the object (considering the entire object size) rather than a "typical" object-copy. Use this at your own risk. This blog may also be interesting reading, although I can't vouch for the correctness of any statements there.
Beyond this, there are a variety of other suggestions here on the cuda tag, you may wish to review them.

I would like to provide a different way to fix the vtable which does not rely on copying the vtable between objects. The idea is to use placement new on the device to let the compiler generate the appropriate vtable. However, this approach also violates the restrictions stated in the programming guide.
#include <cstdio>
struct A{
__host__ __device__
virtual void foo(){
printf("A\n");
}
};
struct B : public A{
B(int i = 13) : data(i){}
__host__ __device__
virtual void foo() override{
printf("B %d\n", data);
}
int data;
};
template<class T>
__global__
void fixKernel(T* ptr){
T tmp(*ptr);
new (ptr) T(tmp);
}
__global__
void useKernel(A* ptr){
ptr->foo();
}
int main(){
A a;
a.foo();
B b(7);
b.foo();
A* ab = new B();
ab->foo();
A* d_a;
cudaMalloc(&d_a, sizeof(A));
cudaMemcpy(d_a, &a, sizeof(A), cudaMemcpyHostToDevice);
B* d_b;
cudaMalloc(&d_b, sizeof(B));
cudaMemcpy(d_b, &b, sizeof(B), cudaMemcpyHostToDevice);
fixKernel<<<1,1>>>(d_a);
useKernel<<<1,1>>>(d_a);
fixKernel<<<1,1>>>(d_b);
useKernel<<<1,1>>>(d_b);
cudaDeviceSynchronize();
cudaFree(d_b);
cudaFree(d_a);
delete ab;
}

Related

How to reuse code for CPU fallback in CUDA

I have some calculations that I want to parallelize if my user has a CUDA-compliant GPU, otherwise I want to execute the same code on the CPU. I don't want to have two versions of the algorithm code, one for CPU and one for GPU to maintain. I'm considering the following approach but am wondering if the extra level of indirection will hurt performance or if there is a better practice.
For my test, I took the basic CUDA template that adds the elements of two integer arrays and stores the result in a third array. I removed the actual addition operation and placed it into its own function marked with both device and host directives...
__device__ __host__ void addSingleItem(int* c, const int* a, const int* b)
{
*c = *a + *b;
}
... then modified the kernel to call the aforementioned function on the element identified by threadIdx...
__global__ void addKernel(int* c, const int* a, const int* b)
{
const unsigned i = threadIdx.x;
addSingleItem(c + i, a + i, b + i);
}
So now my application can check for the presence of a CUDA device. If one is found I can use...
addKernel <<<1, size>>> (dev_c, dev_a, dev_b);
... and if not I can forego parallelization and iterate through the elements calling the host version of the function...
int* pA = (int*)a;
int* pB = (int*)b;
int* pC = (int*)c;
for (int i = 0; i < arraySize; i++)
{
addSingleItem(pC++, pA++, pB++);
}
Everything seems to work in my small test app but I'm concerned about the extra call involved. Do device-to-devce function calls incur any significant performance hits? Is there a more generally accepted way to do CPU fallback that I should adopt?
If addSingleItem and addKernel are defined in the same translation unit/module/file, there should be no cost to having a device-to-device function call. The compiler will aggressively inline that code, as if you wrote it in a single function.
That is undoubtedly the best approach if it can be managed, for the reason described above.
If it's desired to still have some file-level modularity, it is possible to break code into a separate file and include that file in the compilation of the kernel function. Conceptually this is no different than what is described already.
Another possible approach is to use compiler macros to assist in the addition or removal or modification of code to handle the GPU case vs. non-GPU case. There are endless possibilities here, but see here for a simple idea. You can redefine what __host__ __device__ means in different scenarios, for example. I would say this probably only makes sense if you are building separate binaries for the GPU vs. non-GPU case, but you may find a clever way to handle it in the same executable.
Finally, if you desire this but must place the __device__ function in a separate translation unit, it is still possible but there may be some performance loss due to the device-to-device function call across module boundaries. The amount of performance loss here is hard to generalize since it depends heavily on code structure, but it's not unusual to see 10% or 20% performance hit. In that case, you may wish to investigate link-time-optimizations that became available in CUDA 11.
This question may also be of interest, although only tangentially related here.

Why can't a CPU thread take the address of a CUDA __device__ function?

To pass a device function pointer, we have to
typedef int (*fp_t)(int);
__device__ int dev_fn(int)
{
// do something
}
__device__ fp_t dev_fp = dev_fn; // [1]
void host_fn()
{
fp_t fp;
cudaMemcpyFromSymbol(&fp, (void const*)&dev_fp, sizeof(fp));
// now you can pass fp to cuda functions
}
which is quite convoluted as we have to get the address of dev_fn and store it to another symbol as at [1] above.
Why doesn't device function make itself a symbol so that we can use cudaGetSymbolAddress at host side to get the address of dev_fn directly instead of through an intermediate symbol dev_fp?
For whatever reason, CUDA made the design choice of allowing CPU threads to take the address of __device__ variables, probably because there had to be a way to copy data into and out of them. It would make no sense to attempt to copy data to the address of a __device__ function.
It's possible something similar could be engineered to take the address of __device__ functions, but it's unlikely the solution could be made to work well with __host__ __device__ functions. There's no single address that makes sense for __host__ __device__ functions.

Thrust device_malloc and device_new

What are the advantages for using Thrust device_malloc instead of the normal cudaMalloc and what does device_new do?
For device_malloc it seems the only reason to use it is that it's just a bit cleaner.
The device_new documentation says:
"device_new implements the placement new operator for types resident
in device memory. device_new calls T's null constructor on a array of
objects in device memory. No memory is allocated by this function."
Which I don't understand...
device_malloc returns the proper type of object if you plan on using Thrust for other things. There is normally no reason to use cudaMalloc if you are using Thrust. Encapsulating CUDA calls makes it easier and usually cleaner. The same thing goes for C++ and STL containers versus C-style arrays and malloc.
For device_new, you should read the following line of the documentation:
template<typename T>
device_ptr<T> thrust::device_new (device_ptr< void > p, const size_t n = 1)
p: A device_ptr to a region of device memory into which to construct
one or many Ts.
Basically, this function can be used if memory has already been allocated. Only the default constructor will be called, and this will return a device_pointer casted to T's type.
On the other hand, the following method allocates memory and returns a device_ptr<T>:
template<typename T >
device_ptr<T> thrust::device_new (const size_t n = 1)
So I think I found out one good use for device_new
It's basically a better way of initialising an object and copying it to the device, while holding a pointer to it on host.
so instead of doing:
Particle *dev_p;
cudaMalloc((void**)&(dev_p), sizeof(Particle));
cudaMemcpy(dev_p, &p, sizeof(Particle), cudaMemcpyHostToDevice);
test2<<<1,1>>>(dev_p);
I can just do:
thrust::device_ptr<Particle> p = thrust::device_new<Particle>(1);
test2<<<1,1>>>(thrust::raw_pointer_cast(p));

Is this way of allocating a device object "correct"?

SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.
Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.

Cuda host object to device

I am trying to replicate a big class on my cuda device that contains a lot of variables and methods. I have put the class definition into a .cuh file and I am able to create objects and use them in my device code.
The question now is, is there any way of getting an already existing object from host to device ? I am still using a serial version of my code to read in some geometry and physical data. If it is possible to copy it over to the device without using an intermediate array or so, how does the device handle its size without using sizeof ?
Do I use something like this then for the allocation ?
MyClass *MyObject;
int size = sizeog(MyClass);
cudaMalloc((void**)&MyObject_device, size);
cudaMemCpy(Myobject_device, MyObject, size, cudaMemcpyHostToDevice);
any advice would very much appreciated.
The CUDA compiler is designed to match the data structure alignment and packing that is used in the host compiler. So you can safely pass an object between device and host and access the members regardless of their alignment requirements.
You can pass objects directly as kernel parameters. For instance:
Host:
MyKernel<<<grid_dim, block_dim>>>(my_object);
Device:
__global__ void MyKernel(MyObject my_object) {
If you need to pass an array of objects, an easy way is to use thrust::device_vector. For instance:
Host:
#include <thrust/device_vector.h>
device_vector<MyObject> my_objects;
...
MyObject* my_objects_d = thrust::raw_pointer_cast(&my_objects[0]);
MyKernel<<<grid_dim, block_dim>>>(my_objects_d);
Device:
__global__ void MyKernel(MyObject* my_objects) {