confusing on row/column vector in CUBLAS - thrust

I am just starting on CUBLAS/CUDA programming. I mainly use this of matrix and vector operations. I am quite confusing on the orientation of the vector used in CUBLAS. It seems that there is no difference between row and column vector. So if I using level-2 function to multiply a matrix with a vector, how can I specify the orientation of the vector? Will it be always treated as column vector? If I want to multiply a column vector (nx1) to a row vector (1xm) to make a matrix (nxm), should I always treat them as matrix and use level 3 function for multiplication?
Also, I am using thrust to generate the vector, so if I pass the thrust vector (n elements) to cublasCgemm to form a 1xn or nx1 matrix (i.e. row or column vector). Will that vector be treated as 1xn or nx1 vector if I set the cublasOperation_t as CUBLAS_OP_N?
Thanks.

All data are stored in single pointers i.e. double*. They are stored in memory sequentially. There is no difference between row and column vectors. Single pointers are used for 2D arrays, too. CUBLAS gives you an easy define to locate elements in the matrix
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
where i is row, j is column and ld is leading dimension of the matrix. ld is used when you want to use a submatrix of a full matrix in the operation.
The multiplication (nx1)(1xm)=(nxm) is performed by cublasDger function.
cublasStatus_t cublasDger(cublasHandle_t handle, int m, int n,
const double *alpha,
const double *x, int incx,
const double *y, int incy,
double *A, int lda)
If for example y is part of an (kxm) matrix then use incy=k.

Related

Computing the mean of a 2D array CUDA

I need to compute the mean of a 2D array using CUDA, but I don't know how to proceed. I started by doing column reduction after that I will make the sum of the resulting array, and in the last step I will compute the mean.
To do this I need to do the whole work on the device at once? or I just do step by step and each step need a back and forth to and from the CPU and The GPU.
If it's simple arithmetic mean of all the elements in 2D array you can use thrust:
int* data;
int num;
get_data_from_library( &data, &num );
thrust::device_vector< int > iVec(data, data+num);
// transfer to device and compute sum
int sum = thrust::reduce(iVec.begin(), iVec.end(), 0, thrust::plus<int>());
double mean = sum/(double)num;
If you want to write your own kernel - keep in mind that 2D array is essentially an 1D array divided into row-size chunks and go through SDK 'parallel reduction' example : Whitepaper

Multiply matrix by scalar

I'm a newbie with cuda and cublas.
I want to multiply each element in a matrix (I used cublasSetMatrix) with a scalar value.
Can I use cublasscal() for that? the documentation says it's for a vector.
Thanks.
Yes, you can use it for a matrix scaling operation as well, assuming your matrix is stored contiguously. That means you did an ordinary cudaMalloc with a flat pointer to store the matrix. In that case even though it's a "matrix" it's stored contiguously in memory, and so the storage looks the same as a vector. If you have a MxN matrix, then pass MxN as the number of elements in the vector.
For example, something like (omitting error checking for clarity/brevity):
float *mymatrix, *d_mymatrix;
int size = M*N*sizeof(float);
mymatrix = (float *)malloc(size);
cudaMalloc((void **)&d_mymatrix, size);
... (cublas/handle setup)
cublasSetVector(M*N, sizeof(float), mymatrix, 1, d_mymatrix, 1);
float alpha = 5.0;
cublasSscal(handle, M*N, &alpha, d_mymatrix, 1);

Copying data to "cufftComplex" data struct?

I have data stored as arrays of floats (single precision). I have one array for my real data, and one array for my complex data, which I use as the input to FFTs. I need to copy this data into the cufftComplex data type if I want to use the CUDA cufft library. From nVidia: " cufftComplex is a single‐precision, floating‐point complex data type that consists of interleaved real and imaginary components." Data to be operated on by cufft is stored in arrays of cufftComplex.
How do I quickly copy my data from a normal C array into an array of cufftComplex ? I don't want to use a for loop because it's probably the slowest possible option. I don't know how to use memcpy on arrays data of this type, because I do not know how it is stored in memory. Thanks!
You could do this as part of a host-> device copy. Each copy would take one of the contiguous input arrays on the host and copy it in strided fashion to the device. The storage layout of complex data types in CUDA is compatible with the layout defined for complex types in Fortran and C++, i.e. as a structure with the real part followed by imaginary part.
float * real_vec; // host vector, real part
float * imag_vec; // host vector, imaginary part
float2 * complex_vec_d; // device vector, single-precision complex
float * tmp_d = (float *) complex_vec_d;
cudaStat = cudaMemcpy2D (tmp_d, 2 * sizeof(tmp_d[0]),
real_vec, 1 * sizeof(real_vec[0]),
sizeof(real_vec[0]), n, cudaMemcpyHostToDevice);
cudaStat = cudaMemcpy2D (tmp_d + 1, 2 * sizeof(tmp_d[0]),
imag_vec, 1 * sizeof(imag_vec[0]),
sizeof(imag_vec[0]), n, cudaMemcpyHostToDevice);

How do I calculate variance of gpu_array?

I am trying to compute the variance of a 2D gpu_array. A reduction kernel sounds like a good idea:
http://documen.tician.de/pycuda/array.html
However, this documentation implies that reduction kernels just reduce 2 arrays into 1 array. How do I reduce a single 2D array into a single value?
I guess the first step is to define variance for this case. In matlab, the variance function on a 2D array returns a vector (1D-array) of values. But it sounds like you want a single-valued variance, so as others have already suggested, probably the first thing to do is to treat the 2D-array as 1D. In C we won't require any special steps to accomplish this. If you have a pointer to the array you can index into it as if it were a 1D array. I'm assuming you don't need help on how to handle a 2D array with a 1D index.
Now if it's the 1D variance you're after, I'm assuming a function like variance(x)=sum((x[i]-mean(x))^2) where the sum is over all i, is what you're after (based on my read of the wikipedia article ). We can break this down into 3 steps:
compute the mean (this is a classical reduction - one value is produced for the data set - sum all elements then divide by the number of elements)
compute the value (x[i]-mean)^2 for all i - this is an element by element operation producing an output data set equal in size (number of elements) to the input data set
compute the sum of the elements produced in step 2 - this is another classical reduction, as one value is produced for the entire data set.
Both steps 1 and 3 are classical reductions which are summing all elements of an array. Rather than cover that ground here, I'll point you to Mark Harris' excellent treatment of the topic as well as some CUDA sample code. For step 2, I'll bet you could figure out the kernel code on your own, but it would look something like this:
#include <math.h>
__global__ void var(float *input, float *output, unsigned N, float mean){
unsigned idx=threadIdx.x+(blockDim.x*blockIdx.x);
if (idx < N) output[idx] = __powf(input[idx]-mean, 2);
}
Note that you will probably want to combine the reductions and the above code into a single kernel.

CUDA Thrust: Is it possible to have two device_vectors point to overlapping memory?

If I initialize x using thrust::device_vector<double> x(10), is it possible to create a device_vector y that spans x[2] through x[5]?
Note: I don't want memory to be copied, which happens when I use something like thrust::device_vector<double> y(x.begin(), x.end()).
The thrust device_vector only has allocation or copy constructors, so there isn't a direct way to alias an existing vector or device pointer by constructing another device_vector. But as pointed out in comments, it really isn't needed either. Thrust algorithms always work on iterators, and it is possible to use iterator arithmetic to achieve the same outcome. For example this, creates a new vector via copy construction:
thrust::device_vector<double> x(10);
thrust::device_vector<double> y(x.begin()+2, x.begin()+5);
double val = thrust::reduce(y.begin(), y.end());
whereas this returns the same answer without it:
thrust::device_vector<double> x(10);
double val = thrust::reduce(x.begin()+2, x.begin()+5);
The result is the same in both cases, the second equivalent to creating an alias to a subset of the input vector.