I need to compute the mean of a 2D array using CUDA, but I don't know how to proceed. I started by doing column reduction after that I will make the sum of the resulting array, and in the last step I will compute the mean.
To do this I need to do the whole work on the device at once? or I just do step by step and each step need a back and forth to and from the CPU and The GPU.
If it's simple arithmetic mean of all the elements in 2D array you can use thrust:
int* data;
int num;
get_data_from_library( &data, &num );
thrust::device_vector< int > iVec(data, data+num);
// transfer to device and compute sum
int sum = thrust::reduce(iVec.begin(), iVec.end(), 0, thrust::plus<int>());
double mean = sum/(double)num;
If you want to write your own kernel - keep in mind that 2D array is essentially an 1D array divided into row-size chunks and go through SDK 'parallel reduction' example : Whitepaper
Related
I have an array on device of huge length and for some condition check I want to access (On Host/ CPU) only one element from middle (say Nth element). What could be the optimized way for doing this.
Do I need to write a kernel that writes Nth location in single element array from the src array and then I copy single element array to host?
You can copy single element of an array using cudaMemcpy.
Let's say you want to copy N-th element of array:
int * dSourceArray
to variable
int hTargetVariable
You can apply device pointer arithmetics on the host. All you need to do is to move dSourceArray pointer by N elements ant copy single element:
cudaMemcpy(&hTargetVariable, dSourceArray+N, sizeof(int), cudaMemcpyDeviceToHost)
Keep in mind that if you use multiple streams you would like to synchronize the device before transferring the data.
One addendum to answer 1, you may need to take account of the bytes per element of your array. e.g. For an array of arrays of various types on the device:
#ifdef CUDA_KERNEL
char* mgpu[ MAX_BUF ]; // Device array of pointers to arrays of various types.
#else
CUdeviceptr mgpu[ MAX_BUF ]; // on host, gpu is a device pointer.
CUdeviceptr gpu (int n ) { return mgpu[n]; }
CUdeviceptr GPUpointer = m_Fluid.gpu(FGRIDOFF); // Device pointer to FGRIDOFF (int) array
cuMemcpyDtoH (&CPUelement, GPUpointer+(offset*sizeof(int)) , sizeof(int) );
I am trying to compute the variance of a 2D gpu_array. A reduction kernel sounds like a good idea:
http://documen.tician.de/pycuda/array.html
However, this documentation implies that reduction kernels just reduce 2 arrays into 1 array. How do I reduce a single 2D array into a single value?
I guess the first step is to define variance for this case. In matlab, the variance function on a 2D array returns a vector (1D-array) of values. But it sounds like you want a single-valued variance, so as others have already suggested, probably the first thing to do is to treat the 2D-array as 1D. In C we won't require any special steps to accomplish this. If you have a pointer to the array you can index into it as if it were a 1D array. I'm assuming you don't need help on how to handle a 2D array with a 1D index.
Now if it's the 1D variance you're after, I'm assuming a function like variance(x)=sum((x[i]-mean(x))^2) where the sum is over all i, is what you're after (based on my read of the wikipedia article ). We can break this down into 3 steps:
compute the mean (this is a classical reduction - one value is produced for the data set - sum all elements then divide by the number of elements)
compute the value (x[i]-mean)^2 for all i - this is an element by element operation producing an output data set equal in size (number of elements) to the input data set
compute the sum of the elements produced in step 2 - this is another classical reduction, as one value is produced for the entire data set.
Both steps 1 and 3 are classical reductions which are summing all elements of an array. Rather than cover that ground here, I'll point you to Mark Harris' excellent treatment of the topic as well as some CUDA sample code. For step 2, I'll bet you could figure out the kernel code on your own, but it would look something like this:
#include <math.h>
__global__ void var(float *input, float *output, unsigned N, float mean){
unsigned idx=threadIdx.x+(blockDim.x*blockIdx.x);
if (idx < N) output[idx] = __powf(input[idx]-mean, 2);
}
Note that you will probably want to combine the reductions and the above code into a single kernel.
I am looking into how to copy a 2D array of variable width for each row into the GPU.
int rows = 1000;
int cols;
int** host_matrix = malloc(sizeof(*int)*rows);
int *d_array;
int *length;
...
Each host_matrix[i] might have a different length, which I know length[i], and there is where the problem starts. I would like to avoid copying dummy data. Is there a better way of doing it?
According to this thread, that won't be a clever way of doing it:
cudaMalloc(d_array, rows*sizeof(int*));
for(int i = 0 ; i < rows ; i++) {
cudaMalloc((void **)&d_array[i], length[i] * sizeof(int));
}
But I cannot think of any other method. Is there any other smarter way of doing it?
Can it be improved using cudaMallocPitch and cudaMemCpy2D ??
The correct way to allocate an array of pointers for the GPU in CUDA is something like this:
int **hd_array, **d_array;
hd_array = (int **)malloc(nrows*sizeof(int*));
cudaMalloc(d_array, nrows*sizeof(int*));
for(int i = 0 ; i < nrows ; i++) {
cudaMalloc((void **)&hd_array[i], length[i] * sizeof(int));
}
cudaMemcpy(d_array, hd_array, nrows*sizeof(int*), cudaMemcpyHostToDevice);
(disclaimer: written in browser, never compiled, never tested, use at own risk)
The idea is that you assemble a copy of the array of device pointers in host memory first, then copy that to the device. For your hypothetical case with 1000 rows, that means 1001 calls to cudaMalloc and then 1001 calls to cudaMemcpy just to set up the device memory allocations and copy data into the device. That is an enormous overhead penalty, and I would counsel against trying it; the performance will be truly terrible.
If you have very jagged data and need to store it on the device, might I suggest taking a cue of the mother of all jagged data problems - large, unstructured sparse matrices - and copy one of the sparse matrix formats for your data instead. Using the classic compressed sparse row format as a model you could do something like this:
int * data, * rows, * lengths;
cudaMalloc(rows, nrows*sizeof(int));
cudaMalloc(lengths, nrows*sizeof(int));
cudaMalloc(data, N*sizeof(int));
In this scheme, store all the data in a single, linear memory allocation data. The ith row of the jagged array starts at data[rows[i]] and each row has a length of length[i]. This means you only need three memory allocation and copy operations to transfer any amount of data to the device, rather than nrows in your current scheme, ie. it reduces the overheads from O(N) to O(1).
I would put all the data into one array. Then compose another array with the row lengths, so that A[0] is the length of row 0 and so on. so A[i] = length[i]
Then you need just to allocate 2 arrays on the card and call memcopy twice.
Of course it's a little bit of extra work, but i think performance wise it will be an improvement (depending of course on how you use the data on the card)
I need to perform a parallel reduction to find the min or max of an array on a CUDA device. I found a good library for this, called Thrust. It seems that you can only perform a parallel reduction on arrays in host memory. My data is in device memory. Is it possible to perform a reduction on data in device memory?
I can't figure how to do this. Here is documentation for Thrust: http://code.google.com/p/thrust/wiki/QuickStartGuide#Reductions. Thank all of you.
You can do reductions in thrust on arrays which are already in device memory. All that you need to do is wrap your device pointers inside thrust::device_pointer containers, and call one of the reduction procedures, just as shown in the wiki you have linked to:
// assume this is a valid device allocation holding N words of data
int * dmem;
// Wrap raw device pointer
thrust::device_ptr<int> dptr(dmem);
// use max_element for reduction
thrust::device_ptr<int> dresptr = thrust::max_element(dptr, dptr+N);
// retrieve result from device (if required)
int max_value = dresptr[0];
Note that the return value is also a device_ptr, so you can use it directly in other kernels using thrust::raw_pointer_cast:
int * dres = thrust::raw_pointer_cast(dresptr);
If thrust or any other library does not provides you such a service you can still create that kernel yourself.
Mark Harris has a great tutorial about parallel reduction and its optimisations on cuda.
Following his slides it is not that hard to implement and modify it for your needs.
I'm using CUDA to run a problem where I need a complex equation with many input matrices. Each matrix has an ID depending on its set (between 1 and 30, there are 100,000 matrices) and the result of each matrix is stored in a float[N] array where N is the number of input matrices.
After this, the result I want is the sum of every float in this array for each ID, so with 30 IDs there are 30 result floats.
Any suggestions on how I should do this?
Right now, I read the float array (400kb) back to the host from the device and run this on the host:
// Allocate result_array for 100,000 floats on the device
// CUDA process input matrices
// Read from the device back to the host into result_array
float result[10] = { 0 };
for (int i = 0; i < N; i++)
{
result[input[i].ID] += result_array[i];
}
But I'm wondering if there's a better way.
You could use cublasSasum() to do this - this is a bit easier than adapting one of the SDK reductions (but less general of course). Check out the CUBLAS examples in the CUDA SDK.