I'm a newbie with cuda and cublas.
I want to multiply each element in a matrix (I used cublasSetMatrix) with a scalar value.
Can I use cublasscal() for that? the documentation says it's for a vector.
Thanks.
Yes, you can use it for a matrix scaling operation as well, assuming your matrix is stored contiguously. That means you did an ordinary cudaMalloc with a flat pointer to store the matrix. In that case even though it's a "matrix" it's stored contiguously in memory, and so the storage looks the same as a vector. If you have a MxN matrix, then pass MxN as the number of elements in the vector.
For example, something like (omitting error checking for clarity/brevity):
float *mymatrix, *d_mymatrix;
int size = M*N*sizeof(float);
mymatrix = (float *)malloc(size);
cudaMalloc((void **)&d_mymatrix, size);
... (cublas/handle setup)
cublasSetVector(M*N, sizeof(float), mymatrix, 1, d_mymatrix, 1);
float alpha = 5.0;
cublasSscal(handle, M*N, &alpha, d_mymatrix, 1);
Related
I could not find an example of application of cuFFT with CUDA in which the transformation of a matrix is realized as 1D transformations of rows and columns.
I have a 2048x2048 array (set as 1D of cuComplex data). With 2D transform - no problem. But now what I need is to do the transform along x, do some work on it, take inverse fft, then do the transform along y, and do another work on it, then take its inverse transform.
How exactly would the sequence of commands look like if I want to use parallel processing? Should I use cuFFTPlanMany? How? Or, perhaps, is there an example somewhere that I was not able to find?
In the cuFFT Library User's guide, on page 3, there is an example on how computing a number BATCH of one-dimensional DFTs of size NX. Using cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);, then cufftExecC2C will perform a number BATCH 1D FFTs of size NX. To achieve that, you have to arrange your data in a complex array of length BATCH*NX. In your case, for the transform along x, it would be BATCH=2048 and NX=2048. For the transforms along y, you have to transpose the matrix arising from previous calculations.
Your code will look like the following
#define NX 2048
#define NY 2048
int main() {
cufftHandle plan;
cufftComplex *data;
...
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY);
cufftPlan1d(&plan, NX, CUFFT_C2C, NY);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
// do some work
...
// make a transposition
...
cufftPlan1d(&plan, NY, CUFFT_C2C, NX);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
}
I am just starting on CUBLAS/CUDA programming. I mainly use this of matrix and vector operations. I am quite confusing on the orientation of the vector used in CUBLAS. It seems that there is no difference between row and column vector. So if I using level-2 function to multiply a matrix with a vector, how can I specify the orientation of the vector? Will it be always treated as column vector? If I want to multiply a column vector (nx1) to a row vector (1xm) to make a matrix (nxm), should I always treat them as matrix and use level 3 function for multiplication?
Also, I am using thrust to generate the vector, so if I pass the thrust vector (n elements) to cublasCgemm to form a 1xn or nx1 matrix (i.e. row or column vector). Will that vector be treated as 1xn or nx1 vector if I set the cublasOperation_t as CUBLAS_OP_N?
Thanks.
All data are stored in single pointers i.e. double*. They are stored in memory sequentially. There is no difference between row and column vectors. Single pointers are used for 2D arrays, too. CUBLAS gives you an easy define to locate elements in the matrix
#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))
where i is row, j is column and ld is leading dimension of the matrix. ld is used when you want to use a submatrix of a full matrix in the operation.
The multiplication (nx1)(1xm)=(nxm) is performed by cublasDger function.
cublasStatus_t cublasDger(cublasHandle_t handle, int m, int n,
const double *alpha,
const double *x, int incx,
const double *y, int incy,
double *A, int lda)
If for example y is part of an (kxm) matrix then use incy=k.
Is there a way to convert a 2D vector into an array to be able to use it in CUDA kernels?
It is declared as:
vector<vector<int>> information;
I want to cudaMalloc and copy from host to device, what would be the best way to do it?
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
cudaMemcpy(d_information, information, sizeof(int)*size, cudaMemcpyHostToDevice);
In a word, no there isn't. The CUDA API doesn't support deep copying and also doesn't know anything about std::vector either. If you insist on having a vector of vectors as a host source, it will require doing something like this:
int *d_information;
cudaMalloc((void**)&d_information, sizeof(int)*size);
int *dst = d_information;
for (std::vector<std::vector<int> >::iterator it = information.begin() ; it != information.end(); ++it) {
int *src = &((*it)[0]);
size_t sz = it->size();
cudaMemcpy(dst, src, sizeof(int)*sz, cudaMemcpyHostToDevice);
dst += sz;
}
[disclaimer: written in browser, not compiled or tested. Use at own risk]
This would copy the host memory to an allocation in GPU linear memory, requiring one copy for each vector. If the vector of vectors is a "jagged" array, you will want to store an indexing somewhere for the GPU to use as well.
As far as I understand, the vector of vectors do not need to reside in a contiguous memory, i.e. they can be fragmented.
Depending on the amount of memory you need to transfer I would do one of two issues:
Reorder your memory to be a single vector, and then use your cudaMemcpy.
Create a series of cudaMemcpyAsync, where each copy handles a single vector in your vector of vectors, and then synchronize.
How will the cudaMemcpy function work in this case?
I have declared a matrix like this
float imagen[par->N][par->M];
and I want to copy it to the cuda device so I did this
float *imagen_cuda;
int tam_cuda=par->M*par->N*sizeof(float);
cudaMalloc((void**) &imagen_cuda,tam_cuda);
cudaMemcpy(imagen_cuda,imagen,tam_cuda,cudaMemcpyHostToDevice);
Will this copy the 2d array into a 1d array fine?
And how can I copy to another 2d array? can I change this and will it work?
float **imagen_cuda;
It's not trivial to handle a doubly-subscripted C array when copying data between host and device. For the most part, cudaMemcpy (including cudaMemcpy2D) expect an ordinary pointer for source and destination, not a pointer-to-pointer.
The simplest approach (I think) is to "flatten" the 2D arrays, both on host and device, and use index arithmetic to simulate 2D coordinates:
float imagen[par->N][par->M];
float *myimagen = &(imagen[0][0]);
float myval = myimagen[(rowsize*row) + col];
You can then use ordinary cudaMemcpy operations to handle the transfers (using the myimagen pointer):
float *d_myimagen;
cudaMalloc((void **)&d_myimagen, (par->N * par->M)*sizeof(float));
cudaMemcpy(d_myimagen, myimagen, (par->N * par->M)*sizeof(float), cudaMemcpyHostToDevice);
If you really want to handle dynamically sized (i.e. not known at compile time) doubly-subscripted arrays, you can review this question/answer.
I have data stored as arrays of floats (single precision). I have one array for my real data, and one array for my complex data, which I use as the input to FFTs. I need to copy this data into the cufftComplex data type if I want to use the CUDA cufft library. From nVidia: " cufftComplex is a single‐precision, floating‐point complex data type that consists of interleaved real and imaginary components." Data to be operated on by cufft is stored in arrays of cufftComplex.
How do I quickly copy my data from a normal C array into an array of cufftComplex ? I don't want to use a for loop because it's probably the slowest possible option. I don't know how to use memcpy on arrays data of this type, because I do not know how it is stored in memory. Thanks!
You could do this as part of a host-> device copy. Each copy would take one of the contiguous input arrays on the host and copy it in strided fashion to the device. The storage layout of complex data types in CUDA is compatible with the layout defined for complex types in Fortran and C++, i.e. as a structure with the real part followed by imaginary part.
float * real_vec; // host vector, real part
float * imag_vec; // host vector, imaginary part
float2 * complex_vec_d; // device vector, single-precision complex
float * tmp_d = (float *) complex_vec_d;
cudaStat = cudaMemcpy2D (tmp_d, 2 * sizeof(tmp_d[0]),
real_vec, 1 * sizeof(real_vec[0]),
sizeof(real_vec[0]), n, cudaMemcpyHostToDevice);
cudaStat = cudaMemcpy2D (tmp_d + 1, 2 * sizeof(tmp_d[0]),
imag_vec, 1 * sizeof(imag_vec[0]),
sizeof(imag_vec[0]), n, cudaMemcpyHostToDevice);