cuSPARSE dense times sparse - cuda

I need to calculate the following matrix math:
D * A
Where D is dense, and A is sparse, in CSC format.
cuSPARSE allows multiplying sparse * dense, where sparse matrix is in CSR format.
Following a related question, I can "convert" CSC to CSR simply by transposing A.
Also I can calculate (A^T * D^T)^T, as I can handle getting the result transposed.
In this method I can also avoid "transposing" A, because CSR^T is CSC.
The only problem is that cuSPARSE doesn't support transposing D in this operation, so I have to tranpose it beforehand, or convert it to CSR, which is a total waste, as it is very dense.
Is there any workaround?Thanks.

I found a workaround.
I changed the memory accesses to D in my entire code.
If D is an mxn matrix, and I used to access it by D[j * m + i], now I'm accessing it by D[i * n + j], meaning I made it rows-major instead of columns-major.
cuSPARSE expectes matrices in column-major format, and because rows-major transposed is columns-major, I can pass D to cuSPARSE functions as a fake transpose without the need to make the transpose.

Related

CUBLAS accumulate output

This is a very simple question about Cublas library which I strangely couldn't find answer in documentation or elsewhere.
I am using rather old version of CUBLAS (10.2) but it should not matter. I use cublasSgemm to multiply two 32-bit floats matrices A * B and put the result in matrix C:
stat = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, nRows, k, nCols, alpha, A, nRows, B, k, beta, C, nRows);
Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
This can be used for example when memory is limited and one need to shrink sizes of input matrices if are too big and multiply several times. I however couldn't see such option in cublasSgemm?
Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
Yes, cublasSgemm does exactly that. Referring to the documentation:
This function performs the matrix-matrix multiplication
C=αop(A)op(B)+βC
^^^
This is the accumulation part of the formula.
If you set beta to zero, then the previous contents of C will not be accumulated.
If you set beta to 1, then the previous contents of C will be added to the multiplication (AxB) result.
If you set beta to some other value, a scaled (multiplied) version of the previous contents of C will be added.
Note that as far as this description and function are concerned, all of this functionality was defined/specified as part of the netlib BLAS description, and should be similar to other BLAS libraries, and is not unique or specific to CUBLAS.

Octave's default LU factorization function error

New to this forum.
I'm trying to run octave's LU decomposition function with complete pivoting as such:
[L, U, p, q] = lu(A)
for a matrix A I have and I keep getting this error:
"element number 4 undefined in return list"
Element 4 is the matrix of column permutations Q. What's going on? Why doesn't it show? Thanks in advance
If the matrix A is full, the lu function does not perform column exchanges in Octave (emphasis mine):
When called with two or three output arguments and a spare[sic] input matrix, lu does not attempt to perform sparsity preserving column permutations. Called with a fourth output argument, the sparsity preserving column transformation Q is returned, such that P * A * Q = L * U.
So full pivoting is only performed for sparse matrices to maximize sparsity if the fourth output argument is provided for a sparse matrix. The quote above uses "A", but per the function signature provided at the top of the linked Octave documentation section, I believe they meant to write "S": "[L, U, P, Q] = lu (S)".
There does not appear to be a full pivoting option for full matrices by-default.
I'll note that MATLAB has the same behavior for the fourth output of its lu:
Column permutation ... . Use this output to reduce the fill-in (number of nonzeros) in the factors of a sparse matrix.

How to transpose a sparse matrix in cuSparse?

I am trying to compute A^TA using cuSparse. A is a large but sparse matrix. The proper function to use based on the documentation is cusparseDcsrgemm2. However, this is one of the few cuSparse operations that doesn't support an optional built-in transpose for the input matrix. There's a line in the documentation that said
Only the NN version is supported. For other modes, the user has to
transpose A or B explicitly.
The problem is I couldn't find a function in cuSparse that can perform a transpose. I know I can transpose in CPU and copy it to the GPU but that will slow down the application. Am I missing something? What is the right way to use cuSparse to compute A^TA?
For matrices that are in CSR (or CSC) format:
The CSR sparse representation of a matrix has identical format/memory layout as the CSC sparse representation of its transpose.
Therefore, if we use the cusparse provided function to convert a CSR format matrix into a CSC format, that resultant CSC-format matrix is actually the same as the CSR representation of the transpose of the original matrix. Therefore this CSR-to-CSC conversion routine could be used to find the transpose of a CSR format sparse matrix. (It can similarly be used to find the transpose of a CSC format sparse matrix.)

How can solve SVD from row-major matrix using cusolver gesvd function

I'm beginner for cuda. I want to try to solve svd for row-major matrix using cusolver API. but I'm confusing about leading dimension for matrix A.
I have a row-major matrix 100x10.(e.g, I have 100 data which is in the 10 dimensional space.)
As the CUDA documentation, cusolverDnDgesvd function needs lda parameter(leading dimenstion for matrix A). My matrix is row-major so I gave 10 to cusolver gesvd function. But function was not working. This function indicated that my lda parameter was wrong.
Ok, I gave 100 to cusolver gesvd function. Function was working but the results of function (U, S, Vt) seems to be wrong. I mean, I can't get the matrix A from USVt.
As my knowledge, cuSolver API assume all matrix is column-major.
If I changed my matrix into column-major, m is lower than n(10x100). But gesvd function only works for m >= n.
Yes, I'm in trouble. How can I solve this problem?
Row-major, col-major and leading dimension are concepts related to the storage. A matrix can be stored in either scheme, while representing the same mathematical matrix.
To get correct result, you could use cublasDgeam() to change your row-major 100x10 matrix into a col-major 100x10 matrix, which is equivalent to matrix transpose while keeing the storage order, before calling cusolver.
There are many sources talking about storage ordering,
https://en.wikipedia.org/wiki/Row-major_order
https://fgiesen.wordpress.com/2012/02/12/row-major-vs-column-major-row-vectors-vs-column-vectors/
https://eigen.tuxfamily.org/dox-devel/group__TopicStorageOrders.html
Confusion between C++ and OpenGL matrix order (row-major vs column-major)
as well as leading dimension
http://www.ibm.com/support/knowledgecenter/SSFHY8_5.3.0/com.ibm.cluster.essl.v5r3.essl100.doc/am5gr_leaddi.htm
You should google them.

cuSPARSE csrmm with dense matrix in row-major format

I want to use the cuSPARSE csrmm function to multiply two matrices. The A matrix is sparse and the B matrix is dense. The dense matrix is in row-major format. Is there some nice way (trick) to accomplish this without the need to explicitly transpose B? I am thinking of something similar to this for dense BLAS.
Thanks