Passing a struct pointer to a CUDA kernel [duplicate]

Passing a struct pointer to a CUDA kernel [duplicate] - cuda

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Copying a struct containing pointers to CUDA device
I have a structure of device pointers, pointing to arrays allocated on the device.
like this
struct mystruct{
int* dev1;
double* dev2;
.
.
}
There are a large number of arrays in this structure. I started writing a CUDA kernel in which
I passed the pointer to mystruct and then derefernce it within the
CUDA kernel code like this mystruct->dev1[i].
But I realized after writing a few lines that this will not work since by CUDA first principles
you cannot derefernce a host pointer (in this case to mystruct) within a CUDA kernel.
But this is kind of inconveneint, since I will have to pass a larger number of arguments
to my kernels. Is there any way to avoid this. I would like to keep the number of arguments
to my kernel calls as short as possible.

As I explain in this answer, you can pass your struct by value to the kernel, so you don't have to worry about dereferencing a host pointer:
__global__ void kernel(mystruct in)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
in.dev1[idx] *= 2;
in.dev2[idx] += 3.14159;
}
There is the overhead of passing the struct by value to be aware of. However if your struct is not too large, it shouldn't matter.
If you pass the same struct to a lot of kernels, or repeatedly, you may consider copying the struct itself to global or constant memory instead as suggested by aland, or use mapped host memory as suggested by Mark Ebersole. But passing the struct by value is a much simpler way to get started.
(Note: please search StackOverflow before duplicating questions...)

You can copy your mystruct structure to global memory and pass its device address to kernel.
From performance viewpoint, however, it would be better to store mystruct in constant memory, since (I guess) there are a lot of random reads from it by many threads.

You could also use page-locked (pinned) host memory and create the structure within that region if your setup supports it. Please see 3.2.4 of the CUDA programming guide.

Related

2d array use in the kernel

In the CUDA examples i read, I don't find any direct use of 2D array notation [][] in the kernel code when the array is in the global memory unlike when it is in the shared memory, e.g. matrix multiplication. Is there any performance related reason behind this?
Also, i read in a old thread that the following code is incorrect
int **d_array;
cudaMalloc( (void**)&d_array , 5 * sizeof(int*) );
for(int i = 0 ; i < 5 ; i++)
{
cudaMalloc((void **)&d_array[i],10 * sizeof(int));
}
According to the author, "once the main thread assigns memory on the device the main thread loses access to it, that is, it can only be accessed within kernels. So, When you try call cudaMalloc on the 2nd dimension of the array it throws an "Access violation writing location" exception."
I don't understand what the author really means; actually, i find the above code correct
Thank your for your help
SS

Is there any performance related reason behind this?
Yes, a doubly-subscripted array normally requires an extra pointer lookup, i.e. an extra memory read, before the data referenced can be accessed. By using "simulated" 2D access:
int val = d[i*columns+j];
instead of:
int val = d[i][j];
then only a single memory read access is required. The proper indexing is computed directly, rather than requiring the read of a row-pointer. GPUs generally have lots of compute capability compared to memory bandwidth.
I don't understand what the author really means; actually, i find the above code correct
The code is in fact incorrect.
This operation:
cudaMalloc( (void**)&d_array , 5 * sizeof(int*) );
creates a single contiguous allocation on the device, of length equal to 5 pointers storage, and takes the starting address of that allocation, and stores it in the host memory location associated with d_array. That is what cudaMalloc does: it creates a device allocation of the requested length, and stores the starting device address of that allocation in the provided host memory variable.
So let's deconstruct what is being asked for here:
cudaMalloc((void **)&d_array[i],10 * sizeof(int));
This says, create a device allocation of length 10*sizeof(int) and store the starting address of it in the location d_array[i]. But the location associated with d_array[i] is on the device, not the host, and requires dereferencing of the d_array pointer to actually access it, to store something there.
cudaMalloc does not do this. You cannot ask for the starting address of the device allocation to be stored in device memory. You can only ask for the starting address of the device allocation to be stored in host memory.
&d_array
is a pointer to host memory.
&d_array[i]
is a pointer to device memory.
The canonical 2D array worked example is now referenced in the cuda tag info link.

Thrust device_malloc and device_new

What are the advantages for using Thrust device_malloc instead of the normal cudaMalloc and what does device_new do?
For device_malloc it seems the only reason to use it is that it's just a bit cleaner.
The device_new documentation says:
"device_new implements the placement new operator for types resident
in device memory. device_new calls T's null constructor on a array of
objects in device memory. No memory is allocated by this function."
Which I don't understand...

device_malloc returns the proper type of object if you plan on using Thrust for other things. There is normally no reason to use cudaMalloc if you are using Thrust. Encapsulating CUDA calls makes it easier and usually cleaner. The same thing goes for C++ and STL containers versus C-style arrays and malloc.
For device_new, you should read the following line of the documentation:
template<typename T>
device_ptr<T> thrust::device_new (device_ptr< void > p, const size_t n = 1)
p: A device_ptr to a region of device memory into which to construct
one or many Ts.
Basically, this function can be used if memory has already been allocated. Only the default constructor will be called, and this will return a device_pointer casted to T's type.
On the other hand, the following method allocates memory and returns a device_ptr<T>:
template<typename T >
device_ptr<T> thrust::device_new (const size_t n = 1)

So I think I found out one good use for device_new
It's basically a better way of initialising an object and copying it to the device, while holding a pointer to it on host.
so instead of doing:
Particle *dev_p;
cudaMalloc((void**)&(dev_p), sizeof(Particle));
cudaMemcpy(dev_p, &p, sizeof(Particle), cudaMemcpyHostToDevice);
test2<<<1,1>>>(dev_p);
I can just do:
thrust::device_ptr<Particle> p = thrust::device_new<Particle>(1);
test2<<<1,1>>>(thrust::raw_pointer_cast(p));

Is this way of allocating a device object "correct"?

SO I asked a question before about how to allocate an object on the device directly instead of the "normal":
Allocate on host
Copy to device
Copy dynamically allocated fields to device one by one
The main reason I want it to be allocated directly on the device is that I don't want to copy each dynamically allocated field inside one by one manually.
Anyway, so I think I have actually found a way to do this, and I would like to see some input from more experienced CUDA programmers (like Robert Crovella).
Let's see the code first:
class Particle
{
public:
int *data;
__device__ Particle()
{
data = new int[10];
for (int i=0; i<10; i++)
{
data[i] = i*2;
}
}
};
__global__ void test(Particle **result)
{
Particle *p = new Particle();
result[0] = p; // store memory location
}
__global__ void test2(Particle *p)
{
for (int i=0; i<10; i++)
printf("%d\n", p->data[i]);
}
int main() {
// initialise and allocate an object on device
Particle **d_p_addr;
cudaMalloc((void**)&d_p_addr, sizeof(Particle*));
test<<<1,1>>>(d_p_addr);
// copy pointer to host memory
Particle **p_addr = new Particle*[1];
cudaMemcpy(p_addr, d_p_addr, sizeof(Particle*), cudaMemcpyDeviceToHost);
// test:
test2<<<1,1>>>(p_addr[0]);
cudaDeviceSynchronize();
printf("Done!\n");
}
As you can see, what I do is:
Call a kernel that initialises an object on the device and stores its pointer an output parameter
Copy the pointer to the allocated object from device memory to host memory
Now you can pass that pointer to another kernel just fine !
This code actually works, but I'm not sure if there are drawbacks.
Cheers
EDIT: as pointed out by Robert, there was no point of creating a pointer on host first, so I removed that part from the code.

Yes, you can do that.
You are allocating an object on the device, and passing a pointer to it from one kernel to the next. Since a characteristic of device malloc/new is that allocations persist for the lifetime of the context (not just the kernel), the allocations do not disappear at the end of the kernel. This is basically standard C++ behavior, but I thought it might be worth repeating. The pointer(s) that you are passing from one kernel to the next are therefore valid in any subsequent device code in the context of your program.
There is a wrinkle you might want to be aware of, however. Pointers returned by dynamic allocations done on the device (such as via new or malloc in device code) are not usable for transferring data from device to host, at least in the present incarnation of cuda (cuda 5.0 and earlier). The reasons for this are somewhat arcane (translation: I can't adequately explain it) but it's instructive to think about the fact that dynamic allocations come out of the device heap, a region that is logically separate from the region of global memory that runtime API functions like cudaMalloc and cudaMemcpy use. An oblique indication of this is given here:
Memory reserved for the device heap is in addition to memory allocated through host-side CUDA API calls such as cudaMalloc().
If you want to prove this wrinkle to yourself, try adding the following seemingly innocuous code after your second kernel call:
Particle *q;
q = (Particle *)malloc(sizeof(Particle));
cudaMemcpy(q, p_addr[0], sizeof(Particle), cudaMemcpyDeviceToHost);
If you then check the API error value returned from that cudaMemcpy operation, you will observe the error.
As an unrelated comment, your use of the pointer *p is a little freaky, in my book, and the compiler warning given about it is an indication of the wierdness. It's not technically illegal, since you're not actually doing anything meaningful with that pointer (you immediately replace it in your kernel 1) but nevertheless it's wierd because you're passing a pointer to a kernel that you haven't properly cudaMalloc'ed. In the context of what you're demonstrating, it's completely unnecessary, and your first parameter to kernel 1 could be eliminated and replaced with a local variable, eliminating the wierdness and compiler warning.

Parallel reduction on CUDA with array in device

I need to perform a parallel reduction to find the min or max of an array on a CUDA device. I found a good library for this, called Thrust. It seems that you can only perform a parallel reduction on arrays in host memory. My data is in device memory. Is it possible to perform a reduction on data in device memory?
I can't figure how to do this. Here is documentation for Thrust: http://code.google.com/p/thrust/wiki/QuickStartGuide#Reductions. Thank all of you.

You can do reductions in thrust on arrays which are already in device memory. All that you need to do is wrap your device pointers inside thrust::device_pointer containers, and call one of the reduction procedures, just as shown in the wiki you have linked to:
// assume this is a valid device allocation holding N words of data
int * dmem;
// Wrap raw device pointer
thrust::device_ptr<int> dptr(dmem);
// use max_element for reduction
thrust::device_ptr<int> dresptr = thrust::max_element(dptr, dptr+N);
// retrieve result from device (if required)
int max_value = dresptr[0];
Note that the return value is also a device_ptr, so you can use it directly in other kernels using thrust::raw_pointer_cast:
int * dres = thrust::raw_pointer_cast(dresptr);

If thrust or any other library does not provides you such a service you can still create that kernel yourself.
Mark Harris has a great tutorial about parallel reduction and its optimisations on cuda.
Following his slides it is not that hard to implement and modify it for your needs.

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.

About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.

If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008