Is there a most efficient way to multiply three matrices A * B * C = D using cuBLAS? - cuda

I want to find the most efficient way to multiple three matrices using cuBLAS. My current solution has the obvious multiple calls to cublasgemm
cublas<t>gemm(cublasH, transa, transb, m, n, k, &alpha, d_A, lda, d_B, ldb, &beta, d_AB, ldc)
cublas<t>gemm(cublasH, transb, transc, m, n, k, &alpha, d_AB, ldab, d_C, ldc, &beta, d_D, ldd)
It's not a my opinion that this a bad solution. Only that it would be better were there some way to do with a single kernel/function call rather than 2, as a single kernel would presumably get a bit more speed up.
I've looked at cublasgemmBatched hoping there were some manipulation to be made, but it's stated that the multiplications must be independent from each other, so that seems off the table.
Is there some way to use cuBLAS or some other mathematical shortcut worth trying to achieve this optimization?

CUBLAS doesn't have any direct support for this (a single function call that accepts 3 matrices to be multiplied together.)
The way to do it in CUBLAS is the way you have already indicated.

Related

CUBLAS accumulate output

This is a very simple question about Cublas library which I strangely couldn't find answer in documentation or elsewhere.
I am using rather old version of CUBLAS (10.2) but it should not matter. I use cublasSgemm to multiply two 32-bit floats matrices A * B and put the result in matrix C:
stat = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, nRows, k, nCols, alpha, A, nRows, B, k, beta, C, nRows);
Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
This can be used for example when memory is limited and one need to shrink sizes of input matrices if are too big and multiply several times. I however couldn't see such option in cublasSgemm?
Is it possible to make CUBLAS to accumulate the result in C? This means that if C contains some data it would not be erased but accumulated with the multiplication result?
Yes, cublasSgemm does exactly that. Referring to the documentation:
This function performs the matrix-matrix multiplication
C=αop(A)op(B)+βC
^^^
This is the accumulation part of the formula.
If you set beta to zero, then the previous contents of C will not be accumulated.
If you set beta to 1, then the previous contents of C will be added to the multiplication (AxB) result.
If you set beta to some other value, a scaled (multiplied) version of the previous contents of C will be added.
Note that as far as this description and function are concerned, all of this functionality was defined/specified as part of the netlib BLAS description, and should be similar to other BLAS libraries, and is not unique or specific to CUBLAS.

cuFFT of a matrix as a 1D transformation of rows or columns

I could not find an example of application of cuFFT with CUDA in which the transformation of a matrix is realized as 1D transformations of rows and columns.
I have a 2048x2048 array (set as 1D of cuComplex data). With 2D transform - no problem. But now what I need is to do the transform along x, do some work on it, take inverse fft, then do the transform along y, and do another work on it, then take its inverse transform.
How exactly would the sequence of commands look like if I want to use parallel processing? Should I use cuFFTPlanMany? How? Or, perhaps, is there an example somewhere that I was not able to find?
In the cuFFT Library User's guide, on page 3, there is an example on how computing a number BATCH of one-dimensional DFTs of size NX. Using cufftPlan1d(&plan, NX, CUFFT_C2C, BATCH);, then cufftExecC2C will perform a number BATCH 1D FFTs of size NX. To achieve that, you have to arrange your data in a complex array of length BATCH*NX. In your case, for the transform along x, it would be BATCH=2048 and NX=2048. For the transforms along y, you have to transpose the matrix arising from previous calculations.
Your code will look like the following
#define NX 2048
#define NY 2048
int main() {
cufftHandle plan;
cufftComplex *data;
...
cudaMalloc((void**)&data, sizeof(cufftComplex)*NX*NY);
cufftPlan1d(&plan, NX, CUFFT_C2C, NY);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
// do some work
...
// make a transposition
...
cufftPlan1d(&plan, NY, CUFFT_C2C, NX);
...
cufftExecC2C(plan, data, data, CUFFT_FORWARD);
...
}

How threads/blocks are mapped on GPU while calling cublasSgemm/clAmdBlasSgemm routines?

I am interested in knowing how cublasSgemm/clAmdBlasSgemm routines are mapped on GPU while calculating matrix multiplication (C = A * B).
Assume the dimensions of input Matrix ::A_rows = 6144;
A_cols = 12288; B_rows = 12288; B_cols = 15360;
and dimensions of resultant matrix :: C_rows = 6144; C_cols = 15360;
Assume i have initialized the input matrices on host and i copied the matrix data into device memory. After that i am calling following cuBlas or clAmdBlas routines to do matrix multiplication on GPU.
void cublasSgemm (char transa, char transb, int m, int n, int k, float alpha, const float *A, int lda, const float *B, int ldb, float beta, float *C, int ldc);
where m = A_rows; and
n = B_cols;
So my doubts are:
1. ) How these routines are implemented on GPU ?
2. ) Does m and n values mapped on one compute unit (SM)? If No, then what can be maximum value for m and n ?
3. ) Do we have control of threads/Blocks ?
For the host side CUBLAS API (note that I have no idea why you would assume that clAmdBlasSgemm would be the same), the short answer to your questions are as follows:
Modern CUBLAS is closed source. There are code bases like Magma which you could look at to at least get a feel for how CUBLAS might be implemented. You can also run CUBLAS code in one of the NVIDIA supplied profilers to see what it does on the GPU. But the point is that you don't need to know how it works. There is an API and some very thorough documentation. That is all you need to know.
You example problem requires roughly 1.2Gb of memory. If you have a GPU with that much memory, and either enough computational capacity to avoid the display driver watchdog timer, or a compute dedicated GPU, it will work. Memory and the display driver time limitations (where applicable) are the only limitations.
No.
Note that there is also a CUBLAS device API for K20 Kepler devices, and the answers I provided above do not apply to that library.
Before going any further you must read the papers of Volkov and Demmel, have a look here: http://www.cs.berkeley.edu/~volkov/ see his article regarding SGEMM. The answers are there since 2008.

Multiply matrix by scalar

I'm a newbie with cuda and cublas.
I want to multiply each element in a matrix (I used cublasSetMatrix) with a scalar value.
Can I use cublasscal() for that? the documentation says it's for a vector.
Thanks.
Yes, you can use it for a matrix scaling operation as well, assuming your matrix is stored contiguously. That means you did an ordinary cudaMalloc with a flat pointer to store the matrix. In that case even though it's a "matrix" it's stored contiguously in memory, and so the storage looks the same as a vector. If you have a MxN matrix, then pass MxN as the number of elements in the vector.
For example, something like (omitting error checking for clarity/brevity):
float *mymatrix, *d_mymatrix;
int size = M*N*sizeof(float);
mymatrix = (float *)malloc(size);
cudaMalloc((void **)&d_mymatrix, size);
... (cublas/handle setup)
cublasSetVector(M*N, sizeof(float), mymatrix, 1, d_mymatrix, 1);
float alpha = 5.0;
cublasSscal(handle, M*N, &alpha, d_mymatrix, 1);

Non Square Matrix Multiplication in CUDA

The code I use for matrix multiplications in CUDA lets me multiply both square and non square matrices, however, both Width and Height MUST be multiples of blocksize.
So, for example, I can multiply [3][6] * [6][3] (using blocksize=3), but I can't multiply [3][2]*[2][3].
Does anyone knows a way to do that? This is my kernel:
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#define blocksize 3
#define HM (1*blocksize)
#define WM (2*blocksize)
#define WN (1*blocksize)
#define HN WM
#define WP WN
#define HP HM
#define PTH WM
#define PTW HM
__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)
{
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];
int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
float Pvalue=0;
for(int m=0; m< uWM/blocksize;++m){
MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
__syncthreads();
for(int k=0;k<blocksize;k++)
Pvalue+=MS[ty][k]*NS[k][tx];
__syncthreads();
P[rowM*WP+colN]=Pvalue;
}
}
Thanks in advance!
I think the easiest thing to do would be to just pad the blocks on the end with zeros:
for(int m=0; m< uWM/blocksize;++m){
colM = m*blocksize+tx;
rowN = m*blocksize+ty;
if (rowM > uWN || rowN > uWM || colM > uWM || colN > uWN) {
MS[ty][tx]=0.;
NS[ty][tx]=0.;
} else {
MS[ty][tx]=M[rowM*uWM+colM];
NS[ty][tx]=N[colN + uWN*rowN];
}
plus or minus. (That NS line should reference N, not M, right?)
But, since I seem to be the only one here advocating using existing tuned libraries when possible -- why not use CUBLAS or MAGMA instead of rolling your own? They're fast, and tested by hundreds of users.
The underlying performance requirement here is that either the first or second dimension of the shared memory "tile" be a round multiple of 16 - historically that is what is necessary to achieve optimal global memory bandwidth (ie. half warp coalesced transactions). Whether it should be the first or second dimension of the tile is dictated by whether the matrices are stored in column or row major order. There is nothing to say that the shared memory tile need be square, only that the leading dimension of the storage (LDA in BLAS notation) be round multiples of 16.
You could easily template the kernel with the tile dimensions as template arguments and instantiate several versions, depending on matrix dimensions. For a given architecture, there is probably an optimal tile dimension which balances occupancy and instruction level parallelism. The "clever" way to solve this is probably to decompose the matrix multiplication into two operations - the first doing the bulk of the work at the optimal tile size, and the second at a different size for the remaining columns. If the result is going straight back to host memory after the product is completed, the second operation might best be done on the host using an optimised BLAS, overlapped with the GPU kernel. This is the approach that many of the routines in the UTK Magma library use.