Parallel reduction on CUDA with array in device - cuda

I need to perform a parallel reduction to find the min or max of an array on a CUDA device. I found a good library for this, called Thrust. It seems that you can only perform a parallel reduction on arrays in host memory. My data is in device memory. Is it possible to perform a reduction on data in device memory?
I can't figure how to do this. Here is documentation for Thrust: http://code.google.com/p/thrust/wiki/QuickStartGuide#Reductions. Thank all of you.

You can do reductions in thrust on arrays which are already in device memory. All that you need to do is wrap your device pointers inside thrust::device_pointer containers, and call one of the reduction procedures, just as shown in the wiki you have linked to:
// assume this is a valid device allocation holding N words of data
int * dmem;
// Wrap raw device pointer
thrust::device_ptr<int> dptr(dmem);
// use max_element for reduction
thrust::device_ptr<int> dresptr = thrust::max_element(dptr, dptr+N);
// retrieve result from device (if required)
int max_value = dresptr[0];
Note that the return value is also a device_ptr, so you can use it directly in other kernels using thrust::raw_pointer_cast:
int * dres = thrust::raw_pointer_cast(dresptr);

If thrust or any other library does not provides you such a service you can still create that kernel yourself.
Mark Harris has a great tutorial about parallel reduction and its optimisations on cuda.
Following his slides it is not that hard to implement and modify it for your needs.

Related

CUDA: cublas: exception raised at dot product [duplicate]

I need to find the index of the maximum element in an array of floats. I am using the function "cublasIsamax", but this returns the index to the CPU, and this is slowing down the running time of the application.
Is there a way to compute this index efficiently and store it in the GPU?
Thanks!
Since the CUBLAS V2 API was introduced (with CUDA 4.0, IIRC), it is possible to have routines which return a scalar or index to store those directly into a variable in device memory, rather than into a host variable (which entails a device to host transfer and might leave the result in the wrong memory space).
To use this, you need to use the cublasSetPointerMode call to tell the CUBLAS context to expect pointers for scalar arguments to be device pointers by using the CUBLAS_POINTER_MODE_DEVICE mode. This then implies that in a call like
cublasStatus_t cublasIsamax(cublasHandle_t handle, int n,
const float *x, int incx, int *result)
that result must be a device pointer.
If you want to use CUBLAS and you have a GPU with compute capability 3.5 (K20, Titan) than you can use CUBLAS with dynamic parallelism. Than you can call CUBLAS from within a kernel on the GPU and no data will be returned to the CPU.
If you have no device with cc 3.5 you will probably have to implement a find max function by yourself or look for an aditional library.

using thrust::sort inside a thread

I would like to know if thrust::sort() can be used inside a thread
__global__
void mykernel(float* array, int arrayLength)
{
int threadID = blockIdx.x * blockDim.x + threadIdx.x;
// array length is vector in the device global memory
// is it possible to use inside the thread?
thrust::sort(array, array+arrayLength);
// do something else with the array
}
If yes, does the sort launch other kernels to parallelize the sort?
Yes, thrust::sort can be combined with the thrust::seq execution policy to sort numbers sequentially within a single CUDA thread (or sequentially within a single CPU thread):
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
__global__
void mykernel(float* array, int arrayLength)
{
int threadID = blockIdx.x * blockDim.x + threadIdx.x;
// each thread sorts array
// XXX note this causes a data race
thrust::sort(thrust::seq, array, array + arrayLength);
}
Note that your example causes a data race because each CUDA thread attempts to sort the same data in parallel. A correct race-free program would partition array according to thread index.
The thrust::seq execution policy, which is required for this feature, is only available in Thrust v1.8 or better.
#aland already referred you to an earlier answer about calling Thrust's parallel algorithms on the GPU - in that case the asker was simply trying to sort data which was already on the GPU; Thrust called from the CPU can handle GPU-resident data by cast pointers to vectors.
Assuming your question is different and you really want to call a parallel sort in the middle of your kernel (as opposed to break the kernel into multiple smaller kernels and call sort in between) then you should consider CUB, which provides a variety of primitives suitable for your purposes.
Update: Also see #Jared's answer in which he explains that you can call Thrust's sequential algorithms from on the GPU as of Thrust 1.7.

std::vector to array in CUDA

Is there a way to convert a 2D vector into an array to be able to use it in CUDA kernels?
It is declared as:
vector<vector<int>> information;
I want to cudaMalloc and copy from host to device, what would be the best way to do it?
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
cudaMemcpy(d_information, information, sizeof(int)*size, cudaMemcpyHostToDevice);
In a word, no there isn't. The CUDA API doesn't support deep copying and also doesn't know anything about std::vector either. If you insist on having a vector of vectors as a host source, it will require doing something like this:
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
int *dst = d_information;
for (std::vector<std::vector<int> >::iterator it = information.begin() ; it != information.end(); ++it) {
int *src = &((*it)[0]);
size_t sz = it->size();
cudaMemcpy(dst, src, sizeof(int)*sz, cudaMemcpyHostToDevice);
dst += sz;
}
[disclaimer: written in browser, not compiled or tested. Use at own risk]
This would copy the host memory to an allocation in GPU linear memory, requiring one copy for each vector. If the vector of vectors is a "jagged" array, you will want to store an indexing somewhere for the GPU to use as well.
As far as I understand, the vector of vectors do not need to reside in a contiguous memory, i.e. they can be fragmented.
Depending on the amount of memory you need to transfer I would do one of two issues:
Reorder your memory to be a single vector, and then use your cudaMemcpy.
Create a series of cudaMemcpyAsync, where each copy handles a single vector in your vector of vectors, and then synchronize.

Passing a struct pointer to a CUDA kernel [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Copying a struct containing pointers to CUDA device
I have a structure of device pointers, pointing to arrays allocated on the device.
like this
struct mystruct{
int* dev1;
double* dev2;
.
.
}
There are a large number of arrays in this structure. I started writing a CUDA kernel in which
I passed the pointer to mystruct and then derefernce it within the
CUDA kernel code like this mystruct->dev1[i].
But I realized after writing a few lines that this will not work since by CUDA first principles
you cannot derefernce a host pointer (in this case to mystruct) within a CUDA kernel.
But this is kind of inconveneint, since I will have to pass a larger number of arguments
to my kernels. Is there any way to avoid this. I would like to keep the number of arguments
to my kernel calls as short as possible.
As I explain in this answer, you can pass your struct by value to the kernel, so you don't have to worry about dereferencing a host pointer:
__global__ void kernel(mystruct in)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
in.dev1[idx] *= 2;
in.dev2[idx] += 3.14159;
}
There is the overhead of passing the struct by value to be aware of. However if your struct is not too large, it shouldn't matter.
If you pass the same struct to a lot of kernels, or repeatedly, you may consider copying the struct itself to global or constant memory instead as suggested by aland, or use mapped host memory as suggested by Mark Ebersole. But passing the struct by value is a much simpler way to get started.
(Note: please search StackOverflow before duplicating questions...)
You can copy your mystruct structure to global memory and pass its device address to kernel.
From performance viewpoint, however, it would be better to store mystruct in constant memory, since (I guess) there are a lot of random reads from it by many threads.
You could also use page-locked (pinned) host memory and create the structure within that region if your setup supports it. Please see 3.2.4 of the CUDA programming guide.

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.