Cuda host object to device

Cuda host object to device - cuda

I am trying to replicate a big class on my cuda device that contains a lot of variables and methods. I have put the class definition into a .cuh file and I am able to create objects and use them in my device code.
The question now is, is there any way of getting an already existing object from host to device ? I am still using a serial version of my code to read in some geometry and physical data. If it is possible to copy it over to the device without using an intermediate array or so, how does the device handle its size without using sizeof ?
Do I use something like this then for the allocation ?
MyClass *MyObject;
int size = sizeog(MyClass);
cudaMalloc((void**)&MyObject_device, size);
cudaMemCpy(Myobject_device, MyObject, size, cudaMemcpyHostToDevice);
any advice would very much appreciated.

The CUDA compiler is designed to match the data structure alignment and packing that is used in the host compiler. So you can safely pass an object between device and host and access the members regardless of their alignment requirements.
You can pass objects directly as kernel parameters. For instance:
Host:
MyKernel<<<grid_dim, block_dim>>>(my_object);
Device:
__global__ void MyKernel(MyObject my_object) {
If you need to pass an array of objects, an easy way is to use thrust::device_vector. For instance:
Host:
#include <thrust/device_vector.h>
device_vector<MyObject> my_objects;
...
MyObject* my_objects_d = thrust::raw_pointer_cast(&my_objects[0]);
MyKernel<<<grid_dim, block_dim>>>(my_objects_d);
Device:
__global__ void MyKernel(MyObject* my_objects) {

Related

Passing an object with virtual functions to a CUDA kernel [duplicate]

It seems that Cuda does not allow me to "pass an object of a class derived from virtual base classes to __global__ function", for some reason related to "virtual table" or "virtual pointer".
I wonder is there some way for me to setup the "virtual pointer" manually, so that I can use the polymorphism?

Is There Any Way To Copy vtable From Host To Device
You wouldn't want to copy the vtable from host to device. The vtable on the host (i.e. in an object created on the host) has a set of host function pointers in the vtable. When you copy such an object to the device, the vtable doesn't get changed or "fixed up", and so you end up with an object on the device, whose vtable is full of host pointers.
If you then try and call one of those virtual functions (using the object on the device, from device code), bad things happen. The numerical function entry points listed in the vtable are addresses that don't make any sense in device code.
so that I can use the polymorphism
My recommendation for a way to use polymorphism in device code is to create the object on the device. This sets up the vtable with a set of device function pointers, rather than host function pointers, and questions such as this demonstrate that it works. To a first order approximation, if you have a way to create a set of polymorphic objects in host code, I don't know of any reason why you shouldn't be able to use a similar method in device code. The issue really has to do with interoperability - moving such objects between host and device - which is what the stated limitations in the programming guide are referring to.
I wonder is there some way for me to setup the "virtual pointer" manully
There might be. In the interest of sharing knowledge, I will outline a method. However, I don't know C++ well enough to say whether this is acceptable/legal. The only thing I can say is in my very limited testing, it appears to work. But I would assume it is not legal and so I do not recommend you use this method for anything other than experimentation. Even if we don't resolve whether or not it is legal, there is already a stated CUDA limitation (as indicated above) that you should not attempt to pass objects with virtual functions between host and device. So I offer it merely as an observation, which may be interesting for experimentation or research. I don't suggest it for production code.
The basic idea is outlined in this thread. It is predicated on the idea that an ordinary object-copy does not seem to copy the virtual function pointer table, which makes sense to me, but that the object as a whole does contain the table. Therefore if we use a method like this:
template<typename T>
__device__ void fixVirtualPointers(T *other) {
T temp = T(*other); // object-copy moves the "guts" of the object w/o changing vtable
memcpy(other, &temp, sizeof(T)); // pointer copy seems to move vtable
}
it seems to be possible to take a given object, create a new "dummy" object of that type, and then "fix up" the vtable by doing a pointer-based copy of the object (considering the entire object size) rather than a "typical" object-copy. Use this at your own risk. This blog may also be interesting reading, although I can't vouch for the correctness of any statements there.
Beyond this, there are a variety of other suggestions here on the cuda tag, you may wish to review them.

I would like to provide a different way to fix the vtable which does not rely on copying the vtable between objects. The idea is to use placement new on the device to let the compiler generate the appropriate vtable. However, this approach also violates the restrictions stated in the programming guide.
#include <cstdio>
struct A{
__host__ __device__
virtual void foo(){
printf("A\n");
}
};
struct B : public A{
B(int i = 13) : data(i){}
__host__ __device__
virtual void foo() override{
printf("B %d\n", data);
}
int data;
};
template<class T>
__global__
void fixKernel(T* ptr){
T tmp(*ptr);
new (ptr) T(tmp);
}
__global__
void useKernel(A* ptr){
ptr->foo();
}
int main(){
A a;
a.foo();
B b(7);
b.foo();
A* ab = new B();
ab->foo();
A* d_a;
cudaMalloc(&d_a, sizeof(A));
cudaMemcpy(d_a, &a, sizeof(A), cudaMemcpyHostToDevice);
B* d_b;
cudaMalloc(&d_b, sizeof(B));
cudaMemcpy(d_b, &b, sizeof(B), cudaMemcpyHostToDevice);
fixKernel<<<1,1>>>(d_a);
useKernel<<<1,1>>>(d_a);
fixKernel<<<1,1>>>(d_b);
useKernel<<<1,1>>>(d_b);
cudaDeviceSynchronize();
cudaFree(d_b);
cudaFree(d_a);
delete ab;
}

How can I dereference a thrust::device_vector from within a thrust functor?

I'm doing a thrust transform_reduce and need to access a thrust::device_vector from within the functor. I am not iterating on the device_vector. It allows me to declare the functor, passing in the device_vector reference, but won't let me dereference it, either with begin() or operator[].
1>C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include\thrust/detail/function.h(187): warning : calling a host function("thrust::detail::vector_base > ::operator []") from a host device function("thrust::detail::host_device_function ::operator () ") is not allowed
I assume I'll be able to pass in the base pointer and do the pointer math myself, but is there a reason this isn't supported?

Just expanding on what #JaredHoberock has already indicated. I think he will not mind.
A functor usable by thrust must (for the most part) conform to the requirements imposed on any CUDA device code.
Both thrust::host_vector and thrust::device_vector are host code containers used to manipulate host data and device data respectively. A reference to the host code container cannot be used successfully in device code. This means even if you passed a reference to the container successfully, you could not use it (i.e. could not do .push_back(), for example) in device code.
For direct manipulation in device code (such as functors, or kernels) you must extract raw device pointers from thrust and use those directly, with your own pointer arithmetic. And advanced functions (such as .push_back()) will not be available.
There are a variety of ways to extract the raw device pointer corresponding to thrust data, and the following example code demonstrates two possibilities:
$ cat t651.cu
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
__global__ void printkernel(float *data){
printf("data = %f\n", *data);
}
int main(){
thrust::device_vector<float> mydata(5);
thrust::sequence(mydata.begin(), mydata.end());
printkernel<<<1,1>>>(mydata.data().get());
printkernel<<<1,1>>>(thrust::raw_pointer_cast(&mydata[2]));
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t651 t651.cu
$ ./t651
data = 0.000000
data = 2.000000
$

Thrust device_malloc and device_new

What are the advantages for using Thrust device_malloc instead of the normal cudaMalloc and what does device_new do?
For device_malloc it seems the only reason to use it is that it's just a bit cleaner.
The device_new documentation says:
"device_new implements the placement new operator for types resident
in device memory. device_new calls T's null constructor on a array of
objects in device memory. No memory is allocated by this function."
Which I don't understand...

device_malloc returns the proper type of object if you plan on using Thrust for other things. There is normally no reason to use cudaMalloc if you are using Thrust. Encapsulating CUDA calls makes it easier and usually cleaner. The same thing goes for C++ and STL containers versus C-style arrays and malloc.
For device_new, you should read the following line of the documentation:
template<typename T>
device_ptr<T> thrust::device_new (device_ptr< void > p, const size_t n = 1)
p: A device_ptr to a region of device memory into which to construct
one or many Ts.
Basically, this function can be used if memory has already been allocated. Only the default constructor will be called, and this will return a device_pointer casted to T's type.
On the other hand, the following method allocates memory and returns a device_ptr<T>:
template<typename T >
device_ptr<T> thrust::device_new (const size_t n = 1)

So I think I found out one good use for device_new
It's basically a better way of initialising an object and copying it to the device, while holding a pointer to it on host.
so instead of doing:
Particle *dev_p;
cudaMalloc((void**)&(dev_p), sizeof(Particle));
cudaMemcpy(dev_p, &p, sizeof(Particle), cudaMemcpyHostToDevice);
test2<<<1,1>>>(dev_p);
I can just do:
thrust::device_ptr<Particle> p = thrust::device_new<Particle>(1);
test2<<<1,1>>>(thrust::raw_pointer_cast(p));

Is this way of allocating a device object "correct"?

SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.

Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.

About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.

If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008