Question on tensor core GEMM implementation? - cuda

I am reading some tensor core material and related code on simple GEMM. I have two question:
1, when using tensor core for D=A*B+C, it multiplies two fp16 matrices 4x4 and adds the multiplication product fp32 matrix to fp32 accumulator.Why two fp16 input multiplication A*Bresults in fp32 type?
2, in the code example, why the scale factor alpha and beta is needed? in the example, they are set to 2.0f
code snippet from NV blog:
for(int i=0; i < c_frag.num_elements; i++) {
c_frag.x[i] = alpha * acc_frag.x[i] + beta * c_frag.x[i];
}

The Tensorcore designers in this case chose to provide a FP32 accumulate option so that the results of many multiply-accumulate steps could be represented both with greater precision (more mantissa bits) as well as greater range (more exponent bits). This was considered valuable for the overall computational problems they wanted to support, including HPC and AI calculations. The product of two FP16 numbers might be not representable in FP16, whereas many more or most products of two FP16 numbers will be representable in FP32.
The scale factors alpha and beta are provided so that the provided GEMM operation could easily correspond to the well-known BLAS GEMM operation, which is widely used in numerical computation. This allows developers to more easily use the Tensorcore capability to provide a commonly used calculation paradigm in existing numerical computation codes. It is the same reason that the CUBLAS GEMM implementation provides these adjustable parameters.

Related

Elementwise vector Multiplication in cublas [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

CUDA: Is there any api for element wise vector product in cublas? [duplicate]

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

How to perform Hadamard product with CUBLAS on complex numbers?

I need the compute the element wise multiplication of two vectors (Hadamard product) of complex numbers with NVidia CUBLAS. Unfortunately, there is no HAD operation in CUBLAS. Apparently, you can do this with the SBMV operation, but it is not implemented for complex numbers in CUBLAS. I cannot believe there is no way to achieve this with CUBLAS. Is there any other way to achieve that with CUBLAS, for complex numbers ?
I cannot write my own kernel, I have to use CUBLAS (or another standard NVIDIA library if it is really not possible with CUBLAS).
CUBLAS is based on the reference BLAS, and the reference BLAS has never contained a Hadamard product (complex or real). Hence CUBLAS doesn't have one either. Intel have added v?Mul to MKL for doing this, but it is non-standard and not in most BLAS implementations. It is the kind of operation that an old school fortran programmer would just write a loop for, so I presume it really didn't warrant a dedicated routine in BLAS.
There is no "standard" CUDA library I am aware of which implements a Hadamard product. There would be the possibility of using CUBLAS GEMM or SYMM to do this and extracting the diagonal of the resulting matrix, but that would be horribly inefficient, both from a computation and storage stand point.
The Thrust template library can do this trivially using thrust::transform, for example:
thrust::multiplies<thrust::complex<float> > op;
thrust::transform(thrust::device, x, x + n, y, z, op);
would iterate over each pair of inputs from the device pointers x and y and calculate z[i] = x[i] * y[i] (there is probably a couple of casts you need to make to compile that, but you get the idea). But that effectively requires compilation of CUDA code within your project, and apparently you don't want that.

Calculating (A - B(D^-1)B^T )^-1 with CUDA

What might be the most efficient way of calculating the following expression using CUDA C ?
(A - B(D^-1)B^T )^-1
where D is a very large symmetric matrix and A is a small symmetric matrix, which makes B and B^T medium sized rectangular non-symmetric matrices. Of course (^-1) and (^T) are the inverse and transpose operations, respectively.
If you are available to "low" level programming, then matrix inversion could be performed by CULA or MAGMA libraries.
CULA Dense contains single (real or complex) precision of System Solve, Linear Least Squares Solve, and Constrained Linear Least Squares Solve. CULA Sparse is a collection of iterative solvers for sparse matrices. Magma contains dgetrf and dgetri to calculate inverses of square double precision matrices.
For matrix multiplications, including transpositions, you could use cuBLAS routines.
If you prefer "higher" level programming, then ArrayFire enables you to perform matrix multiplications, inversions, transposes, solution of linear systems, and elementwise operations with a more naturale mathematical syntax. Also, Matlab has a GPU Computing Support for NVIDIA CUDA-Enabled GPUs.

CUDA cublas<t>gbmv understanding

I recently wanted to use a simple CUDA matrix-vector multiplication. I found a proper function in cublas library: cublas<<>>gbmv. Here is the official documentation
But it is actually very poor, so I didn't manage to understand what the kl and ku parameters mean. Moreover, I have no idea what stride is (it must also be provided).
There is a brief explanation of these parameters (Page 37), but it looks like I need to know something else.
A search on the internet doesn't provide tons of useful information on this question, mostly references to different version of documentation.
So I have several questions to GPU/CUDA/cublas gurus:
How do I find more understandable docs or guides about using cublas?
If you know how to use this very function, couldn't you explain me how do I use it?
Maybe cublas library is somewhat extraordinary and everyone uses something more popular, better documented and so on?
Thanks a lot.
So BLAS (Basic Linear Algebra Subprograms) generally is an API to, as the name says, basic linear algebra routines. It includes vector-vector operations (level 1 blas routines), matrix-vector operations (level 2) and matrix-matrix operations (level 3). There is a "reference" BLAS available that implements everything correctly, but most of the time you'd use an optimized implementation for your architecture. cuBLAS is an implementation for CUDA.
The BLAS API was so successful as an API that describes the basic operations that it's become very widely adopted. However, (a) the names are incredibly cryptic because of architectural limitations of the day (this was 1979, and the API was defined using names of 8 characters or less to ensure it could widely compile), and (b) it is successful because it's quite general, and so even the simplest function calls require a lot of extraneous arguments.
Because it's so widespread, it's often assumed that if you're doing numerical linear algebra, you already know the general gist of the API, so implementation manuals often leave out important details, and I think that's what you're running into.
The Level 2 and 3 routines generally have function names of the form TMMOO.. where T is the numerical type of the matrix/vector (S/D for single/double precision real, C/Z for single/double precision complex), MM is the matrix type (GE for general - eg, just a dense matrix you can't say anything else about; GB for a general banded matrix, SY for symmetric matrices, etc), and OO is the operation.
This all seems slightly ridiculous now, but it worked and works relatively well -- you quickly learn to scan these for familiar operations so that SGEMV is a single-precision general-matrix times vector multiplication (which is probably what you want, not SGBMV), DGEMM is double-precision matrix-matrix multiply, etc. But it does take some practice.
So if you look at the cublas sgemv instructions, or in the documentation of the original, you can step through the argument list. First, the basic operation is
This function performs the matrix-vector multiplication
y = a op(A)x + b y
where A is a m x n matrix stored in column-major format, x and y
are vectors, and and are scalars.
where op(A) can be A, AT, or AH. So if you just want y = Ax, as is the common case, then a = 1, b = 0. and transa == CUBLAS_OP_N.
incx is the stride between different elements in x; there's lots of situations where this would come in handy, but if x is just a simple 1d array containing the vector, then the stride would be 1.
And that's about all you need for SGEMV.