Optimization for parallel reduction on multi-dimensional vector? [duplicate] - cuda

This question already has answers here:
Summing the rows of a matrix (stored in either row-major or column-major order) in CUDA
(1 answer)
How to perform reduction on a huge 2D matrix along the row direction using cuda? (max value and max value's index for each row)
(1 answer)
CUDA - Parallel Reduction over one axis
(1 answer)
CUDA C sum 1 dimension of 2D array and return
(1 answer)
Reduce matrix rows with CUDA
(3 answers)
Closed last year.
I am looking into the potential optimization way for my kernel code.
Mark Harris's blog provides a good example on 1-D dimension vector. How can I parallelize the code for multi-dimensional data?
For example, I have two rows of data. I want to get the average value for each single row.
This pseudo code can describe what I want to do:
tensor res({data.size[0]});
for(int i=0; i<data.size[0]; i++){
float tmp = 0.0;
for(int j=0; j<data.size[1]; j++){
//accumulated summation for i-th row;
tmp += data.at(i, j);
}
//average value for i-th row;
res.at(i) = tmp / float(data.size[1]);
}
For the inner loop, I can easily adapt the methods to parallelize the execution.
Is there is any suggestion for the outer loop optimization? So I can parallelize the computation for multiple rows.

Related

How to copy a small part of an array from global memory into the local memory of a thread

I have an nxn matrix A (which is stored row by row in an array) with floats. n is divisible by 4. I now want to output a matrix B which is (n/4)x(n/4). In B i want to store the arithmetic middle of all the values of a 4 by 4 sub matrix in A. The submatrixes do not overlap. E.g. If A were 16x16, B would be 4x4, since A has 4 submatrixes.
The thing is, that i want to you use 1 thread per submatrix and i want to have the memory access as coalescent as possible. My idea was to create a kernel and in thread 0 read the values A[0],A[1],A[2],A[3] and in thread 2 A[4],A[5],A[6],A[7]. The thing is, that if i execute these reads after one another (this code only works for the first row of the matrix) e.g.:
float * firstRow[4];
for (int i = 0; i<4; i++):
firstRow[i] = A[threadIdx.x * 4 + i];
the reads would not be coalescent. i would need something like cudaMemcpy but from global memory to a threads memory.
I am not sure if i could copy the whole array into shared memory since it is 4 times bigger than the number of threads. Nevertheless, i want to know the answer to my original question.

Uniformly distributed pseudorandom integers inside CUDA kernel [duplicate]

This question already has answers here:
Generating random number within Cuda kernel in a varying range
(3 answers)
Closed 8 years ago.
How can I generate uniformly distributed pseudorandom integers within a kernel? As far as I know Curand Api allows to use poisson discrete distribution, but not uniform.
I suggest two options within a Kernel:
1) using curand_uniform to obtain a random floating point number from a uniform distribution, then map it to integer interval:
float randu_f = curand_uniform(&localState);
randu_f *= (B-A+0.999999); // You should not use (B-A+1)*
randu_f += A;
int randu_int = __float2int_rz(randu_f);
__float2int_rz Convert the single-precision floating point value x to a signed integer in round-towards-zero mode.
*curand_uniform returns a sequence of pseudorandom floats uniformly distributed between 0.0 and 1.0. It may return from 0.0 to 1.0, where 1.0 is included and 0.0 is excluded.
You should use biggest_float_before_1 or a little less than 1, because there is a small chance You will random 1, and You can get out of bounds. I didn't also check does biggest_float_before_1 and floating-point operations on GPU guarantee not to exceed from defined bounds.
2) calling curand returns a sequence of pseudorandom numbers:
int randu_int = A + curand(&localState) % (B-A);
However, modulo is very expensive on GPU and method 1 is faster.

How do I calculate variance of gpu_array?

I am trying to compute the variance of a 2D gpu_array. A reduction kernel sounds like a good idea:
http://documen.tician.de/pycuda/array.html
However, this documentation implies that reduction kernels just reduce 2 arrays into 1 array. How do I reduce a single 2D array into a single value?
I guess the first step is to define variance for this case. In matlab, the variance function on a 2D array returns a vector (1D-array) of values. But it sounds like you want a single-valued variance, so as others have already suggested, probably the first thing to do is to treat the 2D-array as 1D. In C we won't require any special steps to accomplish this. If you have a pointer to the array you can index into it as if it were a 1D array. I'm assuming you don't need help on how to handle a 2D array with a 1D index.
Now if it's the 1D variance you're after, I'm assuming a function like variance(x)=sum((x[i]-mean(x))^2) where the sum is over all i, is what you're after (based on my read of the wikipedia article ). We can break this down into 3 steps:
compute the mean (this is a classical reduction - one value is produced for the data set - sum all elements then divide by the number of elements)
compute the value (x[i]-mean)^2 for all i - this is an element by element operation producing an output data set equal in size (number of elements) to the input data set
compute the sum of the elements produced in step 2 - this is another classical reduction, as one value is produced for the entire data set.
Both steps 1 and 3 are classical reductions which are summing all elements of an array. Rather than cover that ground here, I'll point you to Mark Harris' excellent treatment of the topic as well as some CUDA sample code. For step 2, I'll bet you could figure out the kernel code on your own, but it would look something like this:
#include <math.h>
__global__ void var(float *input, float *output, unsigned N, float mean){
unsigned idx=threadIdx.x+(blockDim.x*blockIdx.x);
if (idx < N) output[idx] = __powf(input[idx]-mean, 2);
}
Note that you will probably want to combine the reductions and the above code into a single kernel.

What is an effective way to filter numbers according to certain number-ranges using CUDA?

I have a lot of random floating point numbers residing in global GPU memory. I also have "buckets" that specify ranges of numbers they will accept and a capacity of numbers they will accept.
ie:
numbers: -2 0 2 4
buckets(size=1): [-2, 0], [1, 5]
I want to run a filtration process that yields me
filtered_nums: -2 2
(where filtered_nums can be a new block of memory)
But every approach I take runs into a huge overhead of trying to synchronize threads across bucket counters. If I try to use a single-thread, the algorithm completes successfully, but takes frighteningly long (over 100 times slower than generating the numbers in the first place).
What I am asking for is a general high-level, efficient, as-simple-as-possible approach algorithm that you would use to filter these numbers.
edit
I will be dealing with 10 buckets and half a million numbers. Where all the numbers fall into exactly 1 of the 10 bucket ranges. Each bucket will hold 43000 elements. (There are excess elements, since the objective is to fill every bucket, and many numbers will be discarded).
2nd edit
It's important to point out that the buckets do not have to be stored individually. The objective is just to discard elements that would not fit into a bucket.
You can use thrust::remove_copy_if
struct within_limit
{
__host__ __device__
bool operator()(const int x)
{
return (x >=lo && x < hi);
}
};
thrust::remove_copy_if(input, input + N, result, within_limit());
You will have to replace lo and hi with constants for each bin..
I think you can templatize the kernel, but then again you will have to instantiate the template with actual constants. I can't see an easy way at it, but I may be missing something.
If you are willing to look at third party libraries, arrayfire may offer an easier solution.
array I = array(N, input, afDevice);
float **Res = (float **)malloc(sizeof(float *) * nbins);
for(int i = 0; i < nbins; i++) {
array res = where(I >= lo[i] && I < hi[i]);
Res[i] = res.device<float>();
}

CUDA: Summation of results

I'm using CUDA to run a problem where I need a complex equation with many input matrices. Each matrix has an ID depending on its set (between 1 and 30, there are 100,000 matrices) and the result of each matrix is stored in a float[N] array where N is the number of input matrices.
After this, the result I want is the sum of every float in this array for each ID, so with 30 IDs there are 30 result floats.
Any suggestions on how I should do this?
Right now, I read the float array (400kb) back to the host from the device and run this on the host:
// Allocate result_array for 100,000 floats on the device
// CUDA process input matrices
// Read from the device back to the host into result_array
float result[10] = { 0 };
for (int i = 0; i < N; i++)
{
result[input[i].ID] += result_array[i];
}
But I'm wondering if there's a better way.
You could use cublasSasum() to do this - this is a bit easier than adapting one of the SDK reductions (but less general of course). Check out the CUBLAS examples in the CUDA SDK.