Transpose matrix multiplication in cuBLAS howto - cuda

The problem is simple: I have two matrices, A and B, that are M by N, where M >> N. I want to first take the transpose of A, and then multiply that by B (A^T * B) to put that into C, which is N by N. I have everything set up for A and B, but how do I call cublasSgemm properly without it returning the wrong answer?
I understand that cuBlas has a cublasOperation_t enum for transposing things beforehand, but somehow I'm not quite using it correctly. My matrices A and B are in row-major order, i.e. [ row1 ][ row2 ][ row3 ]..... in device memory. That means for A to be interpreted as A-transposed, BLAS needs to know my A is in column-major order. My current code looks like below:
float *A, *B, *C;
// initialize A, B, C as device arrays, fill them with values
// initialize m = num_row_A, n = num_row_B, and k = num_col_A;
// set lda = m, ldb = k, ldc = m;
// alpha = 1, beta = 0;
// set up cuBlas handle ...
cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);
My questions:
Am I setting up m, k, n correctly?
What about lda, ldb, ldc?
Thanks!

Since cuBLAS always assume that the matrices are stored in column-major, you could either transpose your matrices first into colum-major by using cublas_geam(), or you could treat your matrix A, stored in row-major, as a new matrix AT stored in column-major. The matrix AT is actually the transpose of A. For B do the same thing. Then you could calculate matrix C stored in column-major by C=AT * BT^T
float* AT = A;
float* BT = B;
The leading dimension is a param related to the storage, which doesn't change no matter if you use the transpose flag CUBLAS_OP_T or not.
lda = num_col_A = num_row_AT = N;
ldb = num_col_B = num_row_BT = N;
ldc = num_row_C = N;
m and n in the cuBLAS GEMM routine are the #rows and #cols of the result matrix C,
m = num_row_C = num_row_AT = num_col_A = N;
n = num_col_C = num_row_BT = num_col_B = N;
k is the common dimension of A^T and B,
k = num_col_AT = num_row_B = M;
Then you could invoke the GEMM routine by
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, AT, lda, BT, ldb, &beta, C, ldc);
If you want the matrix C to be stored in row-major, you could calculate the CT stored in column-major with the formula CT = BT * AT^T by
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, n, m, k, &alpha, BT, ldb, AT, lda, &beta, CT, ldc);
Please note you don't have to swap m and n since C is a square matrix in this case.

Related

Understanding wmma::col_major vs wmma::row_major in tensor cores API

I'm having a hard time understanding this. Let's say that I fill my matrices like this:
A[i] = [a1 a2 a3 a4 a5 a6...]
B[i] = [b1 b2 b3 b4 b5 b6...]
C[i] = [c1 c2 c3 c4 c5 c6...]
And i want them to appear, e.g., matrix_a: 2x3, as
A = | a1 a2 a3 |
| a4 a5 a6 |
To me this is a row_major approach. Now let's move on to tensor cores. We have:
C = A*B, where the matrix dimensions are given by m,n,k (small letters):
matrix_a: mxk
matrix_b: kxn
matrix_c: mxn
We also set the fragments of the matrices with dimensions M,N,K (capital letters). Then in wmma I see the following fragment declarations:
case I
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::col_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::col_major> b_frag;
But I also have seen
case II
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::row_major> b_frag;
and even
case III
wmma::fragment<wmma::matrix_a, M, N, K, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, M, N, K, half, wmma::col_major> b_frag;
I really don't get why we need these three cases. To me Case II is enough for solving or not?
To make it simple let's focus on Block(0,0). Then the load_matrix_sync reduces
Case I:
wmma::load_matrix_sync(a_frag, ...i * K * m, m);
wmma::load_matrix_sync(b_frag, ...i * K, k);
Case II:
wmma::load_matrix_sync(a_frag, ...i * K, k);
wmma::load_matrix_sync(b_frag, ...i * K * n, n);
To me it looks like Case I is computing B.T*A.T, because b_frag now is the transpose with dimension NxK and is loaded horizontally, i.e., 0.,K,2K,3K, while a_frag now is the tranpose with dimension KxM and is loaded vertically, i.e., 0, K*m, 2K*m. In contrast, in Case II, we explicitly solve A*B. a_frag with dimension MxK is loaded horizontally (0,K,2K,...) and b_frag with dimension KxN is loaded vertically (0,K*n,2K*n,...).
So why would we need these three cases? Actually I don't even know what Case III is used for. Is it computing also A*B?
Thanks

CUBLAS Sgemm confusing results

For two matrices X and Q of size 4x3 and 2x3
which in memory look like
x = [0 1 2 3 4 5 6 7 8 9 10 11]
q = [3 4 5 6 7 8]
I tried to use cublas multiplication cublasSgemm, but I couldn't manage to get expected results.
Since they are stored in row-major order so they should be interpreted as 3x4 and 3x2 so it seemed for me that
cublasSgemm(cublas_handle,
CUBLAS_OP_T, CUBLAS_OP_N,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
where
dim = 3
x_rows_num = 4
q_rows_num = 2
would work but in that case I got error
** On entry to SGEMM parameter number 8 had an illegal value
I also tried shuffling parameters a bit but I couldn't find any setup that would work.
So is it possible to multiply them without changing to column-major order?
EDIT:
So I got exepected results with changes made in this working example:
#include <cublas_v2.h>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
int main()
{
int x_rows_num = 4;
int q_rows_num = 2;
int dim = 3;
int N = x_rows_num*dim;
int M = q_rows_num*dim;
float *x, *q, *x_q_multiplication;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&q, M*sizeof(float));
cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);
for (int i = 0; i< N; i++) x[i] = i*1.0f;
for (int i = 0; i< M; i++) q[i] = (i + 3)*1.0f;
float *q_device;
cudaMallocManaged(&q_device, M*sizeof(float));
cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.f;
float beta = 0.f;
cublasSgemm(handle,
CUBLAS_OP_T, CUBLAS_OP_N,
x_rows_num, q_rows_num, dim,
&alpha,
x, dim,
q, dim,
&beta,
x_q_multiplication, x_rows_num);
cudaDeviceSynchronize();
for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " ";
cudaFree(x);
cudaFree(q);
cudaFree(x_q_multiplication);
return 0;
}
However I'am still not sure why dim became leading dimension
Your original CUBLAS call:
cublasSgemm(cublas_handle,
CUBLAS_OP_T, CUBLAS_OP_N,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
was close to correct. Your interpretation of what the leading dimensions should be was correct. What you got wrong was the Op specifiers. If both matrices are row major ordered and the first array needs to be read in its (row major) transposed order, then the operation should be:
#include <cublas_v2.h>
#include <cstring>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
int main()
{
int x_rows_num = 4;
int q_rows_num = 2;
int dim = 3;
int N = x_rows_num*dim;
int M = q_rows_num*dim;
float x0[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
float q0[6] = {3, 4, 5, 6, 7, 8 };
float *x, *q, *x_q_multiplication;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&q, M*sizeof(float));
cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);
std::memcpy(x, x0, N*sizeof(float));
std::memcpy(q, q0, M*sizeof(float));
float *q_device;
cudaMallocManaged(&q_device, M*sizeof(float));
cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.f;
float beta = 0.f;
cublasSgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_T,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
cudaDeviceSynchronize();
for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " "; std::cout << std::endl;
cudaFree(x);
cudaFree(q);
cudaFree(x_q_multiplication);
return 0;
}
which does this for me:
$ nvcc -arch=sm_52 cublas_trans.cu -o cublas_trans -lcublas
$ ./cublas_trans
76 88 91 106 106 124 121 142
and which I believe is the correct answer.
Incidentally, Robert Crovella's now deleted comment, which you say you take offense to was 100% correct. I suspect he read, as I did, your original CUBLAS call, interpreted the arguments and concluded, as I did, and as CUBLAS itself did, that you are trying to multiply a 3x4 matrix and a 3x2 matrix. Which is why the invalid argument error was raised.

Error "expression must have integral or enum type" in thats code:

Error "expression must have integral or enum type" in thats code:
__global__ void VectorKernel(float *a, float *b, float *c, int n)
{
int i = threadIdx.x;
float y = 0, z = 0;
if (i < n)
y = (b-a) / n;
for (float j = y; j <= n ; j++) {
z = (((j+y) - j) / 6) * function(j) + 4 * (function((j + (y+j)) / 2)) + function(y+j);
c = c + z;
}
}
the error happen in "z", in stretch:
c = c + z;
(i'm beginner in CUDA programming)
c is a pointer. Pointer arithmetic requires a pointer and an integer type expression.
If you want to to add z to the float pointed to by c you should change the expression to:
*c = *c + z;
When you write c = c + z and get an error like this, you should suspect your types are mismatched.
c is a float * and z is a float which are not assignable.
What you probably want to do is store the result of *c + z in the memory location pointed at by c, in which case you'd write:
*c = *c + z.

How to do dot product between matrices in caffe?

In inner product layer, I need to multiply (top_diff * bottom_data) .* (2*weight). First we calculate (result = top_diff * bottom_data) as matrix multiplication in caffe_cpu_gemm and then do a dot product between weight and result.
More explanation is defined as follow:
const Dtype* weight = this->blobs_[0]->cpu_data();
if (this->param_propagate_down_[0]) {
const Dtype* top_diff = top[0]->cpu_diff();
const Dtype* bottom_data = bottom[0]->cpu_data();
caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans, N_, K_, M_, (Dtype)1.,
top_diff, bottom_data, (Dtype)1., this->blobs_[0]->mutable_cpu_diff());
}
For more understanding, I checked math_function.c. It is implemented as follows:
template<>
void caffe_cpu_gemm<float>(const CBLAS_TRANSPOSE TransA,
const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
const float alpha, const float* A, const float* B, const float beta,
float* C) {
int lda = (TransA == CblasNoTrans) ? K : M;
int ldb = (TransB == CblasNoTrans) ? N : K;
cblas_sgemm(CblasRowMajor, TransA, TransB, M, N, K, alpha, A, lda, B,
ldb, beta, C, N);
}
I think I should perform multiplication (result = top_diff * bottom_data) in caffe_cpu_gemm() and after that do dot product with weight. how should I do?!
Many thanks!!!! Any advice would be appreciated!
If you just want to perform dot product between two matrices, you can use the following function to multiply matrices on CPU,
void caffe_mul<float>(const int n, const float* a, const float* b, float* y)
If you want to do the same operation on a GPU, use this template
void caffe_gpu_mul<float>(const int N, const float* a, const float* b, float* y)
a and b are you matrices and c will contain the final result. N is the total number of elements in your matrix.
You can also use the 'Eltwise' layer, which already does this.

Covariance calculation with CUDA

I am implementing Principal Component Analysis (PCA) based face recognition using CUDA. I used orl face database and calculated the mean image and normalized images. I'm facing a problem in calculating the covariance matrix.
__global__ void mean(int* i_data, int num, int size, int* o_data, int WIDTH, int HEIGHT, int* normalized)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * WIDTH;
int r = 0;
int idx_z=0;
for (int z = 0; z < num; ++z)
{
idx_z = z * WIDTH*HEIGHT + idx;
r += i_data[ idx_z ];
}
o_data[ idx ] = int(r/num);
for (int z = 0; z < num; ++z)
{
idx_z = z * WIDTH*HEIGHT + idx;
normalized[idx_z] = abs(i_data[idx_z] - o_data[idx]);
}
}
dim3 dimBlock = dim3(8,4,1);
dim3 dimGrid = dim3(ceil(rows/dimBlock.x) , ceil(cols/dimBlock.y));
mean<<<dimGrid,dimBlock>>>(dev_images, IMAGE_NUM,size,dev_output,rows,cols,dev_normalized);
The database images are of size (92,112).
Your code does not make any sense to me.
Covariance calculation in CUDA can be easily performed by using cuBLAS in conjunction with Thrust. Considering N realizations of K random variables, the covariance estimation formula is the following
where qjk, j,k=1,...,K are the covariance estimate values, Xj and Xk with the overbars are the random variable means as estimated from the available realizations.
Below, I'm reporting a fully worked example:
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
/*************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX */
/*************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nsamples = 3; // --- Number of realizations for each random variable (number of rows of the X matrix)
const int NX = 4; // --- Number of random variables (number of columns of the X matrix)
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_X(Nsamples * NX);
for (size_t i = 0; i < d_X.size(); i++) d_X[i] = (float)dist(rng);
// --- cuBLAS handle creation
cublasHandle_t handle;
cublasSafeCall(cublasCreate(&handle));
/*************************************************/
/* CALCULATING THE MEANS OF THE RANDOM VARIABLES */
/*************************************************/
// --- Array containing the means multiplied by Nsamples
thrust::device_vector<float> d_means(NX);
thrust::device_vector<float> d_ones(Nsamples, 1.f);
float alpha = 1.f / (float)Nsamples;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Nsamples, NX, &alpha, thrust::raw_pointer_cast(d_X.data()), Nsamples,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_means.data()), 1));
/**********************************************/
/* SUBTRACTING THE MEANS FROM THE MATRIX ROWS */
/**********************************************/
thrust::transform(
d_X.begin(), d_X.end(),
thrust::make_permutation_iterator(
d_means.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nsamples))),
d_X.begin(),
thrust::minus<float>());
/*************************************/
/* CALCULATING THE COVARIANCE MATRIX */
/*************************************/
thrust::device_vector<float> d_cov(NX * NX);
alpha = 1.f;
cublasSafeCall(cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, NX, NX, Nsamples, &alpha,
thrust::raw_pointer_cast(d_X.data()), Nsamples, thrust::raw_pointer_cast(d_X.data()), Nsamples, &beta,
thrust::raw_pointer_cast(d_cov.data()), NX));
// --- Final normalization by Nsamples - 1
thrust::transform(
d_cov.begin(), d_cov.end(),
thrust::make_constant_iterator((float)(Nsamples-1)),
d_cov.begin(),
thrust::divides<float>());
for(int i = 0; i < NX * NX; i++) std::cout << d_cov[i] << "\n";
return 0;
}
I implemented covariance calculator with CUBlas and Cuda Thrust and compared with online co variance calculation tools. It seems mine producing good results. The code below planned to QDA Bayes. So matrix given may contain more than one class. So multiple co variance matrices is calculated. I hope it will be useful for someone.
//! Calculates one or more than one coVarianceMatrix given data.
// There can be many classes since many covariance matrixes.
/*!
\param inMatrix This vector contains matrix data in major storage.
Forexample if inMatrix=[1 2 3 4 5 6] and trialSizes=[2] this means matrix we will work on a matrix like :
|1 4 |
|2 5 |
|3 6 | -> 2 Trials, 3 Features. Columns contains feature rows contains trials (samples)
\param trialSizes There can be many classes since many covariance matrixes. Samples from all classes will be given with inMatrix.
But we need to know how many trials(samples) we have for each class.
For example if inMatrix=[1 2 3 4 5 6 7 8 9 10 11 12] and trialSizes=[2,2]
this means matrix we will work on a matrix like :
|1 4 | |7 10 |
|2 5 | |8 11 |
|3 6 | |9 12 | --> Total number of trials(samples which is total rowCount) 2 + 2 = 4 ,
So colSize = inMatrix.size()/4 = 3(feature vector size)
--> There is two element in trialSize vec so each vector has to samples
*/
void multiQDACovianceCalculator(std::vector<float>& inMatrix, std::vector<int>& trialSizes)
{
cublasHandle_t handle; // CUBLAS context
int classCount = trialSizes.size();
int rowSize = std::accumulate(trialSizes.begin(), trialSizes.end(), 0);
int dimensionSize = inMatrix.size() / rowSize;
float alpha = 1.0f;
float beta = 0.0f; // bet =1
thrust::device_vector<float> d_cov1(dimensionSize * dimensionSize);
thrust::device_vector<float> d_cov2(dimensionSize * dimensionSize);
thrust::device_vector<float> d_covResult(dimensionSize * dimensionSize);
thrust::device_vector<float> d_wholeMatrix(inMatrix);
thrust::device_vector<float> d_meansVec(dimensionSize); // rowVec of means of trials
float *meanVecPtr = thrust::raw_pointer_cast(d_meansVec.data());
float *device2DMatrixPtr = thrust::raw_pointer_cast(d_wholeMatrix.data());
auto maxTrialNumber = *std::max_element(trialSizes.begin(), trialSizes.end());
thrust::device_vector<float> deviceVector(maxTrialNumber, 1.0f);
cublasCreate(&handle);
// Inside of for loop one covariance matrix calculated each time
for (int i = 0; i < trialSizes.size(); i++)
{
// X*transpose(X) / N
alpha = 1.0f / trialSizes[i];
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, dimensionSize, dimensionSize, trialSizes[i], &alpha,
device2DMatrixPtr, dimensionSize, device2DMatrixPtr, dimensionSize, &beta,
thrust::raw_pointer_cast(d_cov1.data()), dimensionSize);
// Mean vector of each column
alpha = 1.0f;
cublasSgemv(handle, CUBLAS_OP_N, dimensionSize, trialSizes[i], &alpha, device2DMatrixPtr,
dimensionSize, thrust::raw_pointer_cast(deviceVector.data()), 1, &beta, meanVecPtr, 1);
// MeanVec * transpose(MeanVec) / N*N
alpha = 1.0f / (trialSizes[i] * trialSizes[i]);
cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, dimensionSize, dimensionSize, 1, &alpha,
meanVecPtr, 1, meanVecPtr, 1, &beta,
thrust::raw_pointer_cast(d_cov2.data()), dimensionSize);
alpha = 1.0f;
beta = -1.0f;
// (X*transpose(X) / N) - (MeanVec * transpose(MeanVec) / N*N)
cublasSgeam(handle, CUBLAS_OP_N, CUBLAS_OP_N, dimensionSize, dimensionSize, &alpha,
thrust::raw_pointer_cast(d_cov1.data()), dimensionSize, &beta, thrust::raw_pointer_cast(d_cov2.data()),
dimensionSize, thrust::raw_pointer_cast(d_covResult.data()), dimensionSize);
// Go to other class and calculate its covarianceMatrix
device2DMatrixPtr += trialSizes[i] * dimensionSize;
}
printVector(d_covResult);
cublasDestroy(handle);
}