I am trying to compute the variance of a 2D gpu_array. A reduction kernel sounds like a good idea:
http://documen.tician.de/pycuda/array.html
However, this documentation implies that reduction kernels just reduce 2 arrays into 1 array. How do I reduce a single 2D array into a single value?
I guess the first step is to define variance for this case. In matlab, the variance function on a 2D array returns a vector (1D-array) of values. But it sounds like you want a single-valued variance, so as others have already suggested, probably the first thing to do is to treat the 2D-array as 1D. In C we won't require any special steps to accomplish this. If you have a pointer to the array you can index into it as if it were a 1D array. I'm assuming you don't need help on how to handle a 2D array with a 1D index.
Now if it's the 1D variance you're after, I'm assuming a function like variance(x)=sum((x[i]-mean(x))^2) where the sum is over all i, is what you're after (based on my read of the wikipedia article ). We can break this down into 3 steps:
compute the mean (this is a classical reduction - one value is produced for the data set - sum all elements then divide by the number of elements)
compute the value (x[i]-mean)^2 for all i - this is an element by element operation producing an output data set equal in size (number of elements) to the input data set
compute the sum of the elements produced in step 2 - this is another classical reduction, as one value is produced for the entire data set.
Both steps 1 and 3 are classical reductions which are summing all elements of an array. Rather than cover that ground here, I'll point you to Mark Harris' excellent treatment of the topic as well as some CUDA sample code. For step 2, I'll bet you could figure out the kernel code on your own, but it would look something like this:
#include <math.h>
__global__ void var(float *input, float *output, unsigned N, float mean){
unsigned idx=threadIdx.x+(blockDim.x*blockIdx.x);
if (idx < N) output[idx] = __powf(input[idx]-mean, 2);
}
Note that you will probably want to combine the reductions and the above code into a single kernel.
Related
Based on this post i can apply 1DFFT on each row and then each column of a 2D vector.
If i have a 1D vector i can view it as a 2D using cells like this v[rowIndex * columnCount + columnIndex]
My FFT1D algorithm is padding 0's to a vector until the next power of 2 position.
So if i am using it in this case on my 1D vector viewed as 2D, will it not add 0's where it is not supposed to, in between values, thus messing my result of the FFT2D?
The actual algorithm (I am quoting from the post you referenced in your question) is:
1D FFT on each row (real to complex)
1D FFT on each column resulting from (1) (complex to complex)
The zero-padding might cause problems only if you modify the original 2D matrix while you are working on it, which doesn’t appear to be necessary.
Consider a cuDoubleComplex array a in device memory. Is it possible get pointers to the real and imaginary parts of a without allocating and doing a deep copy into two new double arrays?
something like this:
real_a = //points to real part of a
imag_a = //points to imaginary part of a
instead of something like:
/*allocate real_a and imag_a here */
for(int j=0; j<numElements; j++){
real_a[j]= a[j].x;
imag_a[j]= a[j].y;
}
CUDA does have something like this for numbers, but not for arrays/pointers.
The reason is that I would like to be able to call cuBLAS D rather than Z functions on the real and imaginary parts separately. For example,
cublasDgemm(...,real_a,...,somearray,...,anotherarray,...)
Is it possible get pointers to the real and imaginary parts of a
without allocating and doing a deep copy into two new double arrays?
That can be done:
double* real_a = reinterpret_cast<double*>(&a[0].x); //points to real part of a
double* imag_a = reinterpret_cast<double*>(&a[0].y); //points to imaginary part of a
but note that you need to use a stride of 2 when accessing the pointers to get the correct real or imaginary elements.
The reason is that I would like to be able to call cuBLAS D rather
than Z functions on the real and imaginary parts separately.
This will work with BLAS functions which operate on your real or imaginary pointers as vectors, because those BLAS routines allow a stride to be passed (which must be two in this case).
For example,
cublasDgemm(...,real_a,...,somearray,...,anotherarray,...)
That won't work with the pointers you can directly get as I have shown here. BLAS functions which would treat the array as a matrix do support strided source and destination data, but that stride is applied to the start of each column with the flattened matrix, but not to elements within a column, which is what you would need to make this work correctly.
I need to compute the mean of a 2D array using CUDA, but I don't know how to proceed. I started by doing column reduction after that I will make the sum of the resulting array, and in the last step I will compute the mean.
To do this I need to do the whole work on the device at once? or I just do step by step and each step need a back and forth to and from the CPU and The GPU.
If it's simple arithmetic mean of all the elements in 2D array you can use thrust:
int* data;
int num;
get_data_from_library( &data, &num );
thrust::device_vector< int > iVec(data, data+num);
// transfer to device and compute sum
int sum = thrust::reduce(iVec.begin(), iVec.end(), 0, thrust::plus<int>());
double mean = sum/(double)num;
If you want to write your own kernel - keep in mind that 2D array is essentially an 1D array divided into row-size chunks and go through SDK 'parallel reduction' example : Whitepaper
If I use
float sum = thrust::transform_reduce(d_a.begin(), d_a.end(), conditional_operator(), 0.f, thrust::plus<float>());
I get the sum of all elements meeting a condition provided by conditional_operator(), as in Conditional reduction in CUDA.
But what can I sum only the elements d_a[0], d_a[2], d_a[4], d_a[6], ..... ?
I thought of changing the conditional operator, but it works on on elements in the array without any reference to the index.
What can I do for that?
There are two approaches I can think of for solving this sort of problem:
Use the thrust zip operator to combine a counting iterator with the input data and modify your existing functor to accept tuples of (index, data). You can have the functor return the data when the index matches your criteria, and zero otherwise. This will work correctly with scan and reduction algorithms
Use a thrust permutation iterator to gather the data which you want to sum and pass it to the standard reduce algorithm. The thrust developers have an example strided iterator which you can use to solve the problem of only processing every nth entry in an input iterator.
It might be worth implemented both and benchmarking them to see which approach is faster.
If i is a random walk like below (each index is not unique), and there is a device vector A filled with zeros.
{0, 1, 0, 2, 3, 3, ....}
Is it possible that thrust can make A[i] auto-increment, after the operation A may look like
//2 means appears count of 0's
//1 means appears count of 1's
//1 means appears count of 2's
//2 means appears count of 3's
{2, 1, 1, 2}
I had tried several cases, but these case only works fine when A is a host vector, I guess that because thrust do the parallel, that it previous result can't affect the new one, the result may look like
//only count once no matter the index appear how many times
{1, 1, 1, 1}
Could thrust achieve my goal with device vector A and a random walk index vector?
If you are seeking histogram calculations by Thrust, then you may wish to note that there is a Thrust documentation example providing two different algorithms:
Dense histogram, using sort to sort the array, then using upper_bound to determine a comulative histogram and finally using adjacent_difference to calculating the histogram;
Sparse histogram, using sort to sort the array and then reduce_by_key, as mentioned by #Eric in his comment.
From this two threads
histogram or count_by_key;
How to obtain a histogram from a sorted sequence.
I would say that the above are the only two methods to achieve an histogram using Thrust. I have timed both the approaches on a Kepler K20c card and these have been the timing:
N=1024*16; # bins = 16*16; Dense = 2.0ms; Sparse = 2.4ms;
N=1024*128; # bins = 16*128; Dense = 3.4ms; Sparse = 3.1ms;
On accounting for the fact that the timing does depend on the input array, I would say that the results do not appear to be dramatically different.
It should be noticed that the CUDA samples provide a histogram calculation example, but it is optimized for 64 or 256 bins, so it is not homogeneous to the above mentioned Thrust codes.