CUBLAS accumulate output - cuda

This is a very simple question about Cublas library which I strangely couldn't find answer in documentation or elsewhere.
I am using rather old version of CUBLAS (10.2) but it should not matter. I use cublasSgemm to multiply two 32-bit floats matrices A * B and put the result in matrix C:
stat = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, nRows, k, nCols, alpha, A, nRows, B, k, beta, C, nRows);
Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
This can be used for example when memory is limited and one need to shrink sizes of input matrices if are too big and multiply several times. I however couldn't see such option in cublasSgemm?

Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
Yes, cublasSgemm does exactly that. Referring to the documentation:
This function performs the matrix-matrix multiplication
C=αop(A)op(B)+βC
^^^
This is the accumulation part of the formula.
If you set beta to zero, then the previous contents of C will not be accumulated.
If you set beta to 1, then the previous contents of C will be added to the multiplication (AxB) result.
If you set beta to some other value, a scaled (multiplied) version of the previous contents of C will be added.
Note that as far as this description and function are concerned, all of this functionality was defined/specified as part of the netlib BLAS description, and should be similar to other BLAS libraries, and is not unique or specific to CUBLAS.

Related

Elementwise vector Multiplication in cublas [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

CUDA: Is there any api for element wise vector product in cublas? [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

How can solve SVD from row-major matrix using cusolver gesvd function

I'm beginner for cuda. I want to try to solve svd for row-major matrix using cusolver API. but I'm confusing about leading dimension for matrix A.
I have a row-major matrix 100x10.(e.g, I have 100 data which is in the 10 dimensional space.)
As the CUDA documentation, cusolverDnDgesvd function needs lda parameter(leading dimenstion for matrix A). My matrix is row-major so I gave 10 to cusolver gesvd function. But function was not working. This function indicated that my lda parameter was wrong.
Ok, I gave 100 to cusolver gesvd function. Function was working but the results of function (U, S, Vt) seems to be wrong. I mean, I can't get the matrix A from USVt.
As my knowledge, cuSolver API assume all matrix is column-major.
If I changed my matrix into column-major, m is lower than n(10x100). But gesvd function only works for m >= n.
Yes, I'm in trouble. How can I solve this problem?
Row-major, col-major and leading dimension are concepts related to the storage. A matrix can be stored in either scheme, while representing the same mathematical matrix.
To get correct result, you could use cublasDgeam() to change your row-major 100x10 matrix into a col-major 100x10 matrix, which is equivalent to matrix transpose while keeing the storage order, before calling cusolver.
There are many sources talking about storage ordering,
https://en.wikipedia.org/wiki/Row-major_order
https://fgiesen.wordpress.com/2012/02/12/row-major-vs-column-major-row-vectors-vs-column-vectors/
https://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html
Confusion between C++ and OpenGL matrix order (row-major vs column-major)
as well as leading dimension
http://www.ibm.com/support/knowledgecenter/SSFHY8_5.3.0/com.ibm.cluster.essl.v5r3.essl100.doc/am5gr_leaddi.htm
You should google them.

cublas: same input and output matrix for better performance?

I see CUBLAS may be an efficient algorithm package for a single large matrices multiplication or addition etc. But in a common setting, most computations are dependent. So the next step relies on the result of the previous step.
This causes one problem, because the output matrix has to be different from input matrix in CUBLAS routine( as input matrices are const ), much time are spend to malloc space and copy data from device to device for these temporary matrices.
So is it possible to do things like multiply(A, A, B), where the first argument is ouput matrix and the second/third are input matrices, to avoid extra memory manipulation time? Or is there a better workaround?
Thanks a lot !
No, it is not possible to perform in-place operations like gemm using CUBLAS (in fact, I am not aware of any parallel BLAS implementation which guarantees such an operation will work).
Having said that, this comment:
.... much time are spend to malloc space and copy data from device to device for these temporary matrices.
makes me think you might be overlooking the obvious. While it is necessary to allocate space for interim matrices, it certainly isn't necessary to perform device to device memory copies when using such allocations. This:
// If A, B & C are pointers to allocations in device memory
// compute C = A*B and copy result to A
multiply(C, A, B);
cudaMemcpy(A, C, sizeA, cudaMemcpyDeviceToDevice);
// now A = A*B
can be replaced by
multiply(C, A, B);
float * tmp = A; A = C; C = tmp;
ie. you only need to exchange pointers on the host to perform the equivalent of a device to device memory copy, but with no GPU time cost. This can't be used in every situation (for example, there are some in-place block operations which might still require an explicit memory transfer), but in most cases an explicit device to device memory transfer can be avoided.
If the memory cost of large dense operations with CUBLAS is limiting your application, consider investigating "out of core" approaches to working with large dense matrices.
You could pre alloc a buffer matrix, and copy the input matrix A to the buffer before the mat-mul operation.
Memcopy(buff, A);
Multiply(A, buffer, B);
By reusing the buffer, you don't need to allocate the buffer every time, and the overhead will be only one mem copy for each mat-mul. When your matrix is large enough, the time cost of the overhead will take very small portion and can be ignored.

cuSPARSE dense times sparse

I need to calculate the following matrix math:
D * A
Where D is dense, and A is sparse, in CSC format.
cuSPARSE allows multiplying sparse * dense, where sparse matrix is in CSR format.
Following a related question, I can "convert" CSC to CSR simply by transposing A.
Also I can calculate (A^T * D^T)^T, as I can handle getting the result transposed.
In this method I can also avoid "transposing" A, because CSR^T is CSC.
The only problem is that cuSPARSE doesn't support transposing D in this operation, so I have to tranpose it beforehand, or convert it to CSR, which is a total waste, as it is very dense.
Is there any workaround?Thanks.
I found a workaround.
I changed the memory accesses to D in my entire code.
If D is an mxn matrix, and I used to access it by D[j * m + i], now I'm accessing it by D[i * n + j], meaning I made it rows-major instead of columns-major.
cuSPARSE expectes matrices in column-major format, and because rows-major transposed is columns-major, I can pass D to cuSPARSE functions as a fake transpose without the need to make the transpose.