CUBLAS Sgemm confusing results - cuda

For two matrices X and Q of size 4x3 and 2x3
which in memory look like
x = [0 1 2 3 4 5 6 7 8 9 10 11]
q = [3 4 5 6 7 8]
I tried to use cublas multiplication cublasSgemm, but I couldn't manage to get expected results.
Since they are stored in row-major order so they should be interpreted as 3x4 and 3x2 so it seemed for me that
cublasSgemm(cublas_handle,
CUBLAS_OP_T, CUBLAS_OP_N,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
where
dim = 3
x_rows_num = 4
q_rows_num = 2
would work but in that case I got error
** On entry to SGEMM parameter number 8 had an illegal value
I also tried shuffling parameters a bit but I couldn't find any setup that would work.
So is it possible to multiply them without changing to column-major order?
EDIT:
So I got exepected results with changes made in this working example:
#include <cublas_v2.h>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
int main()
{
int x_rows_num = 4;
int q_rows_num = 2;
int dim = 3;
int N = x_rows_num*dim;
int M = q_rows_num*dim;
float *x, *q, *x_q_multiplication;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&q, M*sizeof(float));
cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);
for (int i = 0; i< N; i++) x[i] = i*1.0f;
for (int i = 0; i< M; i++) q[i] = (i + 3)*1.0f;
float *q_device;
cudaMallocManaged(&q_device, M*sizeof(float));
cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.f;
float beta = 0.f;
cublasSgemm(handle,
CUBLAS_OP_T, CUBLAS_OP_N,
x_rows_num, q_rows_num, dim,
&alpha,
x, dim,
q, dim,
&beta,
x_q_multiplication, x_rows_num);
cudaDeviceSynchronize();
for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " ";
cudaFree(x);
cudaFree(q);
cudaFree(x_q_multiplication);
return 0;
}
However I'am still not sure why dim became leading dimension

Your original CUBLAS call:
cublasSgemm(cublas_handle,
CUBLAS_OP_T, CUBLAS_OP_N,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
was close to correct. Your interpretation of what the leading dimensions should be was correct. What you got wrong was the Op specifiers. If both matrices are row major ordered and the first array needs to be read in its (row major) transposed order, then the operation should be:
#include <cublas_v2.h>
#include <cstring>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
int main()
{
int x_rows_num = 4;
int q_rows_num = 2;
int dim = 3;
int N = x_rows_num*dim;
int M = q_rows_num*dim;
float x0[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
float q0[6] = {3, 4, 5, 6, 7, 8 };
float *x, *q, *x_q_multiplication;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&q, M*sizeof(float));
cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);
std::memcpy(x, x0, N*sizeof(float));
std::memcpy(q, q0, M*sizeof(float));
float *q_device;
cudaMallocManaged(&q_device, M*sizeof(float));
cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.f;
float beta = 0.f;
cublasSgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_T,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
cudaDeviceSynchronize();
for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " "; std::cout << std::endl;
cudaFree(x);
cudaFree(q);
cudaFree(x_q_multiplication);
return 0;
}
which does this for me:
$ nvcc -arch=sm_52 cublas_trans.cu -o cublas_trans -lcublas
$ ./cublas_trans
76 88 91 106 106 124 121 142
and which I believe is the correct answer.
Incidentally, Robert Crovella's now deleted comment, which you say you take offense to was 100% correct. I suspect he read, as I did, your original CUBLAS call, interpreted the arguments and concluded, as I did, and as CUBLAS itself did, that you are trying to multiply a 3x4 matrix and a 3x2 matrix. Which is why the invalid argument error was raised.

Related

Why PyCUDA is faster than C CUDA in this example

I am exploring to move from OpenCL to CUDA, and did a few tests to benchmark the speed of CUDA in various implementations. To my surprise, in the examples below, the PyCUDA implementation is about 20% faster than the C CUDA example.
I read many posts talking about "release build" of C CUDA code. I did try having -Xptxas -O3 in the makefile and that really did not make a difference. I also tried to adjust the block size, with which the kernel was executed. Unfortunately, it did not help improve the speed, either.
My questions here are:
What could be the reasons leading to the speed difference between C CUDA and PYCUDA?
If the "advanced" (lack of a better word) compiling in PYCUDA is one of reasons, how can I optimize the compiling of my C CUDA code?
Are there any other ways to improve the speed of C CUDA in this case?
While I appreciate general comments, I am looking for actionable suggestions that I can validate on my machine. Thanks!
import pycuda.autoinit
import pycuda.driver as drv
import numpy as np
from pycuda.compiler import SourceModule
import time
mod = SourceModule(
"""
__global__ void saxpy(int n, const float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n){
y[i] = a * x[i] + y[i];
}
}
"""
)
saxpy = mod.get_function("saxpy")
N = 1 << 25
time_elapse = 0.0
for i in range(100):
# print(i)
# print(N)
x = np.ones(N).astype(np.float32)
y = 2 * np.ones(N).astype(np.float32)
start = time.time()
saxpy(
np.int32(N),
np.float32(2.0),
drv.In(x),
drv.InOut(y),
block=(512, 1, 1),
grid=(int(N / 512) + 1, 1),
)
time_elapse += (time.time() - start)
print(time_elapse )
print(y[-100:-1])
print(y.sum())
print(N * 4.0)
#include <stdio.h>
#include <time.h>
#define DIM 512
__global__ void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = a * x[i] + y[i];
}
int main(int num_iterations)
{
double start;
double cputime;
int N = 1 << 25;
float *x, *y, *d_x, *d_y;
int i, j;
for (j = 0; j < num_iterations; j++)
{
x = (float *)malloc(N * sizeof(float));
y = (float *)malloc(N * sizeof(float));
cudaMalloc(&d_x, N * sizeof(float));
cudaMalloc(&d_y, N * sizeof(float));
for (i = 0; i < N; i++)
{
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice);
// Perform SAXPY on 1M elements
start = clock();
saxpy<<<(N + DIM) / DIM, DIM>>>(N, 2.0f, d_x, d_y);
cputime += ((double)(clock() - start) / CLOCKS_PER_SEC);
cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);
// float maxError = 0.0f;
// for (int i = 0; i < N; i++){
// maxError = max(maxError, abs(y[i] - 4.0f));
// //printf("y[%d]: %f\n", i,y[i]);
// }
// printf("Max error: %f\n", maxError);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
}
printf("cpu time is %f\n", cputime);
return 0;
}
I saved the above file as cuda_example.cu and compile it with the following commands in a makefile:
nvcc -arch=sm_61 -Xptxas -O3,-v -o main cuda_example.cu
If I execute your CUDA-C code as is, and set num_iterations to 300 like this:
int num_iterations =300;
then the execution of your program takes about 60s on a Geforce GTX 1650. Your code is extremely inefficient, as you copy data back and forth between GPU and device at every iteration.
So, lets restrict the loop to just the kernel execution:
#include <stdio.h>
#include <time.h>
#define DIM 512
__global__ void saxpy(int n, float a, float *x, float *y)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
y[i] = a * x[i] + y[i];
}
int main()
{
double start = clock();
int N = 1 << 25;
float *x, *y, *d_x, *d_y;
int i, j;
int num_iterations = 300;
x = (float *)malloc(N * sizeof(float));
y = (float *)malloc(N * sizeof(float));
cudaMalloc(&d_x, N * sizeof(float));
cudaMalloc(&d_y, N * sizeof(float));
for (i = 0; i < N; i++)
{
x[i] = 1.0f;
y[i] = 2.0f;
}
cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N * sizeof(float), cudaMemcpyHostToDevice);
for (j = 0; j < num_iterations; j++){
saxpy<<<(N + DIM) / DIM, DIM>>>(N, 2.0f, d_x, d_y);
cudaDeviceSynchronize();
}
cudaMemcpy(y, d_y, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_x);
cudaFree(d_y);
free(x);
free(y);
double cputime = ((double)(clock() - start) / CLOCKS_PER_SEC);
printf("cpu time is %f\n", cputime);
return 0;
}
If I do that, then the execution time becomes 1.36 seconds. Doing sth similar to the PyCUDA code I got about 19s of execution time.

cublasDtrsm after LU with pivoting

I am stuck at a small problem. I've got to solve a linear System A * x = b.
The matrix A gets decomposed by an LU-factorization (LAPACK). As result I get the factorized Matrix and the pivotarray. After that I want to solve the two linear Systems: U * x = y and L * y = b on the GPU with *cublasDtrsm*. But because of the row interchanges from dgetrf in LAPACK I would have to pass the pivot array to cublas. But the *cublasDtrsm*-function don't offers something for this. Without the pivot array I get wrong results.
I already searched for disabling pivoting in LAPACK, but regarding to stability it's not possible. Is there any hint how to solve a linear Equation system with LU-factorization?
If you wanted to use this particular approach (cublas trsm after LAPACK getrf), I believe you should be able to use cublas trsm with the L,U output of LAPACK by rearranging your b vector (or matrix) to match the rearrangement order that LAPACK performed during pivoting. I believe this order is given in the formula for ipiv in the LAPACK documentation:
IPIV
IPIV is INTEGER array, dimension (min(M,N))
The pivot indices; for 1 <= i <= min(M,N), row i of the
matrix was interchanged with row IPIV(i).
Here's a sample code that demonstrates the idea for a simple 3x3 test case with a single RHS vector:
$ cat t853.cu
#include <cstdio>
#include <cstdlib>
#include <cuda_runtime.h>
#include <cublas_v2.h>
#define cudacall(call) \
do \
{ \
cudaError_t err = (call); \
if(cudaSuccess != err) \
{ \
fprintf(stderr,"CUDA Error:\nFile = %s\nLine = %d\nReason = %s\n", __FILE__, __LINE__, cudaGetErrorString(err)); \
cudaDeviceReset(); \
exit(EXIT_FAILURE); \
} \
} \
while (0)
#define cublascall(call) \
do \
{ \
cublasStatus_t status = (call); \
if(CUBLAS_STATUS_SUCCESS != status) \
{ \
fprintf(stderr,"CUBLAS Error:\nFile = %s\nLine = %d\nCode = %d\n", __FILE__, __LINE__, status); \
cudaDeviceReset(); \
exit(EXIT_FAILURE); \
} \
\
} \
while(0)
void LU_device(float *src_d, int n, int *pivot)
{
cublasHandle_t handle;
cublascall(cublasCreate_v2(&handle));
int batchSize = 1;
int *P, *INFO;
cudacall(cudaMalloc<int>(&P,n * batchSize * sizeof(int)));
cudacall(cudaMalloc<int>(&INFO,batchSize * sizeof(int)));
int lda = n;
float *A[] = { src_d };
float **A_d;
cudacall(cudaMalloc<float*>(&A_d,sizeof(A)));
cudacall(cudaMemcpy(A_d,A,sizeof(A),cudaMemcpyHostToDevice));
cublascall(cublasSgetrfBatched(handle,n,A_d,lda,P,INFO,batchSize));
int INFOh = 0;
cudacall(cudaMemcpy(&INFOh,INFO,sizeof(int),cudaMemcpyDeviceToHost));
cudacall(cudaMemcpy(pivot,P,n*batchSize*sizeof(int),cudaMemcpyDeviceToHost));
#ifdef DEBUG_PRINT
for (int qq = 0; qq < n*batchSize; qq++) {printf("pivot[%d] = %d\n", qq, pivot[qq]); }
#endif
if(INFOh == n)
{
fprintf(stderr, "Factorization Failed: Matrix is singular\n");
cudaDeviceReset();
exit(EXIT_FAILURE);
}
cudaFree(P); cudaFree(INFO); cudaFree(A_d); cublasDestroy(handle);
}
void LU(float* src, float* L, float *U, int n, int *pivot)
{
float *src_d;
cudacall(cudaMalloc<float>(&src_d, n*n * sizeof(float)));
cudacall(cudaMemcpy(src_d,src,n*n * sizeof(float),cudaMemcpyHostToDevice));
LU_device(src_d,n,pivot);
cudacall(cudaMemcpy(L,src_d,n * n * sizeof(float),cudaMemcpyDeviceToHost));
cudacall(cudaMemcpy(U,src_d,n * n * sizeof(float),cudaMemcpyDeviceToHost));
for (int i = 0; i < n; i ++){
for (int j = 0; j < i; j++) L[i*n+j] = 0.0;
for (int j = i+1; j < n; j++) U[i*n+j] = 0.0;}
cudaFree(src_d);
}
void rearrange(float *vec, int *pivot, int n, int dir){
#define DIR_FORWARD 0
#define DIR_REVERSE 1
#define SWAP(x,y) {float swaptmp=(*(y)); (*(y))=(*(x)); (*(x))=swaptmp;}
if (dir == DIR_FORWARD)
for (int i = 0; i < n; i++) SWAP((vec+i),(vec+pivot[i]-1))
else
for (int i = n-1; i >= 0; i--) SWAP((vec+i),(vec+pivot[i]-1))
}
void TRSM(float *A, float *x, float *b, int n, cublasFillMode_t uplo, cublasDiagType_t diagt ){
cublasHandle_t handle;
cublascall(cublasCreate_v2(&handle));
float *A_d, *b_d;
cudacall(cudaMalloc<float>(&A_d, n*n * sizeof(float)));
cudacall(cudaMalloc<float>(&b_d, n * sizeof(float)));
cudacall(cudaMemcpy(b_d, b, n*sizeof(float), cudaMemcpyHostToDevice));
cudacall(cudaMemcpy(A_d, A, n*n*sizeof(float), cudaMemcpyHostToDevice));
const float alpha = 1.0f;
cublascall(cublasStrsm(handle, CUBLAS_SIDE_LEFT, uplo, CUBLAS_OP_N, diagt, n, 1, &alpha, A_d, n, b_d, n));
cudacall(cudaMemcpy(x, b_d, n*sizeof(float), cudaMemcpyDeviceToHost));
cudaFree(A_d); cudaFree(b_d); cublasDestroy(handle);
}
void test_solve()
{
// solve Ax=b
// 1. Perform LU on A
// 2. using pivot sequence, rearrange b -> b'
// 3. perform TRSM on Ly=b'
// 4. perform TRSM on Ux=y
// A = |0 1 4 |
// |3 3 9 |
// |4 10 16|
// x = |1|
// |2|
// |3|
// b = |14|
// |36|
// |72|
const int n = 3;
// has 3,2,3 pivot order
float A_col_major[n*n] = { 0, 3, 4,
1, 3, 10,
4, 9, 16 };
float b1[n] = {14, 36, 72};
/* another example - has 3,3,3 pivot order
float A_transpose[n*n] = { 0, 1, 4,
3, 3, 9,
4, 10, 16 };
float b2[n] = {18, 37, 70};
*/
float result_x[n];
int pivot[n];
float L[n*n];
float U[n*n];
float y[n];
//Select matrix by setting "a"
float *a = A_col_major;
float *b = b1;
printf("Input:\n\n");
for(int i=0; i<n; i++)
{
for(int j=0; j<n; j++)
printf("%f\t",a[i*n+j]);
printf("\n");
}
printf("\n\n");
// 1. LU on A
LU(a,L,U,n,pivot);
#ifdef DEBUG_PRINT
printf("L:\n\n");
for(int i=0; i<n; i++)
{
for(int j=0; j<n; j++)
printf("%f\t",L[i*n+j]);
printf("\n");
}
printf("\n\n");
printf("U:\n\n");
for(int i=0; i<n; i++)
{
for(int j=0; j<n; j++)
printf("%f\t",U[i*n+j]);
printf("\n");
}
printf("\n\n");
#endif
// 2. Rearrange b
rearrange(b,pivot,n,DIR_FORWARD);
#ifdef DEBUG_PRINT
for (int i = 0; i < n; i++) printf("b'[%d] = %f\n", i, b[i]);
#endif
// 3. TRSM on Ly=b
TRSM(L, y, b, n, CUBLAS_FILL_MODE_LOWER, CUBLAS_DIAG_UNIT);
// 4. TRSM on Ux=y
TRSM(U, result_x, y, n, CUBLAS_FILL_MODE_UPPER, CUBLAS_DIAG_NON_UNIT);
fprintf(stdout, "Solution:\n\n");
for(int i=0; i<n; i++)
{
printf("%f\n",result_x[i]);
}
}
int main()
{
test_solve();
return 0;
}
$ nvcc -o t853 t853.cu -lcublas
$ ./t853
Input:
0.000000 3.000000 4.000000
1.000000 3.000000 10.000000
4.000000 9.000000 16.000000
Solution:
1.000000
2.000000
3.000000
$
Note that for this simple test case I used cublas getrfBatched to do the matrix LU factorization, rather than LAPACK, but I think it should behave similarly to LAPACK.
Also note that I'm not intending to comment on the "best approaches for linear system solutions" but merely to explain how the approach you mapped out might be made to work.
For permutation on the GPU a permutation matrix can be created out of the given vector and multiplied it with B on the GPU. In fact the permutation vector from LAPACK is meant as an sequential order of swapping steps. So if the n-th line has been touched by the for-loop it will never be touched again. Hence a small Algorithm creates a permutation matrix P out of the vector from *<T>getrf*. With that the linear System L * U * X = P * B will be solved. This leads to the correct results.
void
permutationMatrix ( int const rows, //number of rows of A
int const cols, //number of cols of A
int* permArray, //permutation vector from LAPACK
double* permMatrix) //Memory for permutation matrix
{
int tempPerm [rows]; //holds where the ones later shall be in the Matrix
int swap; //variable for swapping
memset(permMatrix,0, rows * cols * sizeof(double)); //fill permutation Matrix with 0s
memset(tempPerm,0, rows * sizeof(int)); //fill temporary memory with 0s
for (int row = 0; row < rows; row ++)
{
//start value for each temp field is the row-number
if (tempPerm [row] == 0)
{
tempPerm [row] = row + 1;
}
/* rows need to be swapped if rownumber != number
* in permutation vector of LAPACK*/
if (permArray[row] != row + 1)
{
//swap with a line which hasn't already swapped
if (tempPerm[permArray[row]-1] == 0)
{
tempPerm[permArray[row]-1] = tempPerm[row];
tempPerm[row] = permArray[row];
}else{
//swap with an already touched line
swap = tempPerm[permArray[row]-1];
tempPerm[permArray[row]-1] = tempPerm[row];
tempPerm[row] = swap;
}
}
//put the one in place in the permutation matrix
permMatrix[row + (tempPerm[row]-1) * rows] = 1.0;
}
}

Covariance calculation with CUDA

I am implementing Principal Component Analysis (PCA) based face recognition using CUDA. I used orl face database and calculated the mean image and normalized images. I'm facing a problem in calculating the covariance matrix.
__global__ void mean(int* i_data, int num, int size, int* o_data, int WIDTH, int HEIGHT, int* normalized)
{
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
int idx = x + y * WIDTH;
int r = 0;
int idx_z=0;
for (int z = 0; z < num; ++z)
{
idx_z = z * WIDTH*HEIGHT + idx;
r += i_data[ idx_z ];
}
o_data[ idx ] = int(r/num);
for (int z = 0; z < num; ++z)
{
idx_z = z * WIDTH*HEIGHT + idx;
normalized[idx_z] = abs(i_data[idx_z] - o_data[idx]);
}
}
dim3 dimBlock = dim3(8,4,1);
dim3 dimGrid = dim3(ceil(rows/dimBlock.x) , ceil(cols/dimBlock.y));
mean<<<dimGrid,dimBlock>>>(dev_images, IMAGE_NUM,size,dev_output,rows,cols,dev_normalized);
The database images are of size (92,112).
Your code does not make any sense to me.
Covariance calculation in CUDA can be easily performed by using cuBLAS in conjunction with Thrust. Considering N realizations of K random variables, the covariance estimation formula is the following
where qjk, j,k=1,...,K are the covariance estimate values, Xj and Xk with the overbars are the random variable means as estimated from the available realizations.
Below, I'm reporting a fully worked example:
#include <cublas_v2.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <thrust/random.h>
#include <thrust/sequence.h>
#include <stdio.h>
#include <iostream>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
/*************************************/
/* CONVERT LINEAR INDEX TO ROW INDEX */
/*************************************/
template <typename T>
struct linear_index_to_row_index : public thrust::unary_function<T,T> {
T Ncols; // --- Number of columns
__host__ __device__ linear_index_to_row_index(T Ncols) : Ncols(Ncols) {}
__host__ __device__ T operator()(T i) { return i / Ncols; }
};
/********/
/* MAIN */
/********/
int main()
{
const int Nsamples = 3; // --- Number of realizations for each random variable (number of rows of the X matrix)
const int NX = 4; // --- Number of random variables (number of columns of the X matrix)
// --- Random uniform integer distribution between 10 and 99
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(10, 99);
// --- Matrix allocation and initialization
thrust::device_vector<float> d_X(Nsamples * NX);
for (size_t i = 0; i < d_X.size(); i++) d_X[i] = (float)dist(rng);
// --- cuBLAS handle creation
cublasHandle_t handle;
cublasSafeCall(cublasCreate(&handle));
/*************************************************/
/* CALCULATING THE MEANS OF THE RANDOM VARIABLES */
/*************************************************/
// --- Array containing the means multiplied by Nsamples
thrust::device_vector<float> d_means(NX);
thrust::device_vector<float> d_ones(Nsamples, 1.f);
float alpha = 1.f / (float)Nsamples;
float beta = 0.f;
cublasSafeCall(cublasSgemv(handle, CUBLAS_OP_T, Nsamples, NX, &alpha, thrust::raw_pointer_cast(d_X.data()), Nsamples,
thrust::raw_pointer_cast(d_ones.data()), 1, &beta, thrust::raw_pointer_cast(d_means.data()), 1));
/**********************************************/
/* SUBTRACTING THE MEANS FROM THE MATRIX ROWS */
/**********************************************/
thrust::transform(
d_X.begin(), d_X.end(),
thrust::make_permutation_iterator(
d_means.begin(),
thrust::make_transform_iterator(thrust::make_counting_iterator(0), linear_index_to_row_index<int>(Nsamples))),
d_X.begin(),
thrust::minus<float>());
/*************************************/
/* CALCULATING THE COVARIANCE MATRIX */
/*************************************/
thrust::device_vector<float> d_cov(NX * NX);
alpha = 1.f;
cublasSafeCall(cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, NX, NX, Nsamples, &alpha,
thrust::raw_pointer_cast(d_X.data()), Nsamples, thrust::raw_pointer_cast(d_X.data()), Nsamples, &beta,
thrust::raw_pointer_cast(d_cov.data()), NX));
// --- Final normalization by Nsamples - 1
thrust::transform(
d_cov.begin(), d_cov.end(),
thrust::make_constant_iterator((float)(Nsamples-1)),
d_cov.begin(),
thrust::divides<float>());
for(int i = 0; i < NX * NX; i++) std::cout << d_cov[i] << "\n";
return 0;
}
I implemented covariance calculator with CUBlas and Cuda Thrust and compared with online co variance calculation tools. It seems mine producing good results. The code below planned to QDA Bayes. So matrix given may contain more than one class. So multiple co variance matrices is calculated. I hope it will be useful for someone.
//! Calculates one or more than one coVarianceMatrix given data.
// There can be many classes since many covariance matrixes.
/*!
\param inMatrix This vector contains matrix data in major storage.
Forexample if inMatrix=[1 2 3 4 5 6] and trialSizes=[2] this means matrix we will work on a matrix like :
|1 4 |
|2 5 |
|3 6 | -> 2 Trials, 3 Features. Columns contains feature rows contains trials (samples)
\param trialSizes There can be many classes since many covariance matrixes. Samples from all classes will be given with inMatrix.
But we need to know how many trials(samples) we have for each class.
For example if inMatrix=[1 2 3 4 5 6 7 8 9 10 11 12] and trialSizes=[2,2]
this means matrix we will work on a matrix like :
|1 4 | |7 10 |
|2 5 | |8 11 |
|3 6 | |9 12 | --> Total number of trials(samples which is total rowCount) 2 + 2 = 4 ,
So colSize = inMatrix.size()/4 = 3(feature vector size)
--> There is two element in trialSize vec so each vector has to samples
*/
void multiQDACovianceCalculator(std::vector<float>& inMatrix, std::vector<int>& trialSizes)
{
cublasHandle_t handle; // CUBLAS context
int classCount = trialSizes.size();
int rowSize = std::accumulate(trialSizes.begin(), trialSizes.end(), 0);
int dimensionSize = inMatrix.size() / rowSize;
float alpha = 1.0f;
float beta = 0.0f; // bet =1
thrust::device_vector<float> d_cov1(dimensionSize * dimensionSize);
thrust::device_vector<float> d_cov2(dimensionSize * dimensionSize);
thrust::device_vector<float> d_covResult(dimensionSize * dimensionSize);
thrust::device_vector<float> d_wholeMatrix(inMatrix);
thrust::device_vector<float> d_meansVec(dimensionSize); // rowVec of means of trials
float *meanVecPtr = thrust::raw_pointer_cast(d_meansVec.data());
float *device2DMatrixPtr = thrust::raw_pointer_cast(d_wholeMatrix.data());
auto maxTrialNumber = *std::max_element(trialSizes.begin(), trialSizes.end());
thrust::device_vector<float> deviceVector(maxTrialNumber, 1.0f);
cublasCreate(&handle);
// Inside of for loop one covariance matrix calculated each time
for (int i = 0; i < trialSizes.size(); i++)
{
// X*transpose(X) / N
alpha = 1.0f / trialSizes[i];
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, dimensionSize, dimensionSize, trialSizes[i], &alpha,
device2DMatrixPtr, dimensionSize, device2DMatrixPtr, dimensionSize, &beta,
thrust::raw_pointer_cast(d_cov1.data()), dimensionSize);
// Mean vector of each column
alpha = 1.0f;
cublasSgemv(handle, CUBLAS_OP_N, dimensionSize, trialSizes[i], &alpha, device2DMatrixPtr,
dimensionSize, thrust::raw_pointer_cast(deviceVector.data()), 1, &beta, meanVecPtr, 1);
// MeanVec * transpose(MeanVec) / N*N
alpha = 1.0f / (trialSizes[i] * trialSizes[i]);
cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, dimensionSize, dimensionSize, 1, &alpha,
meanVecPtr, 1, meanVecPtr, 1, &beta,
thrust::raw_pointer_cast(d_cov2.data()), dimensionSize);
alpha = 1.0f;
beta = -1.0f;
// (X*transpose(X) / N) - (MeanVec * transpose(MeanVec) / N*N)
cublasSgeam(handle, CUBLAS_OP_N, CUBLAS_OP_N, dimensionSize, dimensionSize, &alpha,
thrust::raw_pointer_cast(d_cov1.data()), dimensionSize, &beta, thrust::raw_pointer_cast(d_cov2.data()),
dimensionSize, thrust::raw_pointer_cast(d_covResult.data()), dimensionSize);
// Go to other class and calculate its covarianceMatrix
device2DMatrixPtr += trialSizes[i] * dimensionSize;
}
printVector(d_covResult);
cublasDestroy(handle);
}

Singular values calculation only with CUDA

I'm trying to use the new cusolverDnSgesvd routine of CUDA 7.0 for the calculation of the singular values. The full code is reported below:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include<iostream>
#include<stdlib.h>
#include<stdio.h>
#include <cusolverDn.h>
#include <cuda_runtime_api.h>
/***********************/
/* CUDA ERROR CHECKING */
/***********************/
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
}
}
void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
/********/
/* MAIN */
/********/
int main(){
int M = 10;
int N = 10;
// --- Setting the host matrix
float *h_A = (float *)malloc(M * N * sizeof(float));
for(unsigned int i = 0; i < M; i++){
for(unsigned int j = 0; j < N; j++){
h_A[j*M + i] = (i + j) * (i + j);
}
}
// --- Setting the device matrix and moving the host matrix to the device
float *d_A; gpuErrchk(cudaMalloc(&d_A, M * N * sizeof(float)));
gpuErrchk(cudaMemcpy(d_A, h_A, M * N * sizeof(float), cudaMemcpyHostToDevice));
// --- host side SVD results space
float *h_U = (float *)malloc(M * M * sizeof(float));
float *h_V = (float *)malloc(N * N * sizeof(float));
float *h_S = (float *)malloc(N * sizeof(float));
// --- device side SVD workspace and matrices
int work_size = 0;
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
float *d_U; gpuErrchk(cudaMalloc(&d_U, M * M * sizeof(float)));
float *d_V; gpuErrchk(cudaMalloc(&d_V, N * N * sizeof(float)));
float *d_S; gpuErrchk(cudaMalloc(&d_S, N * sizeof(float)));
cusolverStatus_t stat;
// --- CUDA solver initialization
cusolverDnHandle_t solver_handle;
cusolverDnCreate(&solver_handle);
stat = cusolverDnSgesvd_bufferSize(solver_handle, M, N, &work_size);
if(stat != CUSOLVER_STATUS_SUCCESS ) std::cout << "Initialization of cuSolver failed. \N";
float *work; gpuErrchk(cudaMalloc(&work, work_size * sizeof(float)));
//float *rwork; gpuErrchk(cudaMalloc(&rwork, work_size * sizeof(float)));
// --- CUDA SVD execution
//stat = cusolverDnSgesvd(solver_handle, 'A', 'A', M, N, d_A, M, d_S, d_U, M, d_V, N, work, work_size, NULL, devInfo);
stat = cusolverDnSgesvd(solver_handle, 'N', 'N', M, N, d_A, M, d_S, d_U, M, d_V, N, work, work_size, NULL, devInfo);
cudaDeviceSynchronize();
int devInfo_h = 0;
gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
std::cout << "devInfo = " << devInfo_h << "\n";
switch(stat){
case CUSOLVER_STATUS_SUCCESS: std::cout << "SVD computation success\n"; break;
case CUSOLVER_STATUS_NOT_INITIALIZED: std::cout << "Library cuSolver not initialized correctly\n"; break;
case CUSOLVER_STATUS_INVALID_VALUE: std::cout << "Invalid parameters passed\n"; break;
case CUSOLVER_STATUS_INTERNAL_ERROR: std::cout << "Internal operation failed\n"; break;
}
if (devInfo_h == 0 && stat == CUSOLVER_STATUS_SUCCESS) std::cout << "SVD successful\n\n";
// --- Moving the results from device to host
gpuErrchk(cudaMemcpy(h_S, d_S, N * sizeof(float), cudaMemcpyDeviceToHost));
for(int i = 0; i < N; i++) std::cout << "d_S["<<i<<"] = " << h_S[i] << std::endl;
cusolverDnDestroy(solver_handle);
return 0;
}
If I ask for the computation of the full SVD (commented line with jobu = 'A' and jobvt = 'A') everything works fine. If I ask for the computation of the singular values only (line with jobu = 'N' and jobvt = 'N'), cusolverDnSgesvd returns
CUSOLVER_STATUS_INVALID_VALUE
Please note that, in this case devInfo = 0, so I cannot spot the invalid parameter.
Please also note that the documentation PDF lacks information about the rwork parameter so that I have dealt with it as a dummy parameter.
At this time the cuSolver gesvd function only supports jobu = 'A' and jobvt = 'A'
So the error when you specify other combinations is expected. From the documentation:
Remark 2: gesvd only supports jobu='A' and jobvt='A' and returns matrix U and VH
USE OF cusolver<T>nSgesvd
As remarked by lebedov, as of CUDA 8.0, it is now possible to calculate the singular values only by cusolverDnSgesvd. I report below a slightly modified version of your code with two calls to cusolverDnSgesvd, one performing the singular values calculation only
cusolverDnSgesvd(solver_handle, 'N', 'N', M, N, d_A, M, d_S, NULL, M, NULL, N, work, work_size, NULL, devInfo)
and one performing the full SVD calculation
cusolverDnSgesvd(solver_handle, 'A', 'A', M, N, d_A, M, d_S, d_U, M, d_V, N, work, work_size, NULL, devInfo)
As you already remarked, the two 'A' fields for the full SVD case are changed to 'N' in the singular values only case. Please, note that, in the singular values only case, there is no need to store space for the singular vector matrices U and V. Indeed, a NULL pointer is passed.
The singular values calculation only is faster than the full SVD calculation. On a GTX 960, for a 1000x1000 matrix, the timing has been the following:
Singular values only: 559 ms
Full SVD: 2239 ms
Here is the full code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include<iostream>
#include<stdlib.h>
#include<stdio.h>
#include <cusolverDn.h>
#include <cuda_runtime_api.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
/********/
/* MAIN */
/********/
int main(){
int M = 1000;
int N = 1000;
TimingGPU timerGPU;
float elapsedTime;
// --- Setting the host matrix
float *h_A = (float *)malloc(M * N * sizeof(float));
for (unsigned int i = 0; i < M; i++){
for (unsigned int j = 0; j < N; j++){
h_A[j*M + i] = (i + j) * (i + j);
}
}
// --- Setting the device matrix and moving the host matrix to the device
float *d_A; gpuErrchk(cudaMalloc(&d_A, M * N * sizeof(float)));
gpuErrchk(cudaMemcpy(d_A, h_A, M * N * sizeof(float), cudaMemcpyHostToDevice));
// --- host side SVD results space
float *h_U = (float *)malloc(M * M * sizeof(float));
float *h_V = (float *)malloc(N * N * sizeof(float));
float *h_S = (float *)malloc(N * sizeof(float));
// --- device side SVD workspace and matrices
int work_size = 0;
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
float *d_U; gpuErrchk(cudaMalloc(&d_U, M * M * sizeof(float)));
float *d_V; gpuErrchk(cudaMalloc(&d_V, N * N * sizeof(float)));
float *d_S; gpuErrchk(cudaMalloc(&d_S, N * sizeof(float)));
cusolverStatus_t stat;
// --- CUDA solver initialization
cusolverDnHandle_t solver_handle;
cusolveSafeCall(cusolverDnCreate(&solver_handle));
cusolveSafeCall(cusolverDnSgesvd_bufferSize(solver_handle, M, N, &work_size));
float *work; gpuErrchk(cudaMalloc(&work, work_size * sizeof(float)));
// --- CUDA SVD execution - Singular values only
timerGPU.StartCounter();
cusolveSafeCall(cusolverDnSgesvd(solver_handle, 'N', 'N', M, N, d_A, M, d_S, NULL, M, NULL, N, work, work_size, NULL, devInfo));
elapsedTime = timerGPU.GetCounter();
int devInfo_h = 0;
gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
if (devInfo_h == 0)
printf("SVD successfull for the singular values calculation only\n\n");
else if (devInfo_h < 0)
printf("SVD unsuccessfull for the singular values calculation only. Parameter %i is wrong\n", -devInfo_h);
else
printf("SVD unsuccessfull for the singular values calculation only. A number of %i superdiagonals of an intermediate bidiagonal form did not converge to zero\n", devInfo_h);
printf("Calculation of the singular values only: %f ms\n\n", elapsedTime);
// --- Moving the results from device to host
//gpuErrchk(cudaMemcpy(h_S, d_S, N * sizeof(float), cudaMemcpyDeviceToHost));
//for (int i = 0; i < N; i++) std::cout << "d_S[" << i << "] = " << h_S[i] << std::endl;
// --- CUDA SVD execution - Full SVD
timerGPU.StartCounter();
cusolveSafeCall(cusolverDnSgesvd(solver_handle, 'A', 'A', M, N, d_A, M, d_S, d_U, M, d_V, N, work, work_size, NULL, devInfo));
elapsedTime = timerGPU.GetCounter();
devInfo_h = 0;
gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
if (devInfo_h == 0)
printf("SVD successfull for the full SVD calculation\n\n");
else if (devInfo_h < 0)
printf("SVD unsuccessfull for the full SVD calculation. Parameter %i is wrong\n", -devInfo_h);
else
printf("SVD unsuccessfull for the full SVD calculation. A number of %i superdiagonals of an intermediate bidiagonal form did not converge to zero\n", devInfo_h);
printf("Calculation of the full SVD calculation: %f ms\n\n", elapsedTime);
cusolveSafeCall(cusolverDnDestroy(solver_handle));
return 0;
}
EDIT - PERFORMANCE ACROSS DIFFERENT VERSIONS OF CUDA
I have compared the performance of the singular values only calculation and the the Full SVD computations for CUDA 8.0, CUDA 9.1 and CUDA 10.0, for a 5000x5000 matrix. Here are the results on a GTX 960.
Computation type CUDA 8.0 CUDA 9.1 CUDA 10.0
__________________________________________________________________
Singular values only 17s 15s 15s
Full SVD 161s 159s 457s
__________________________________________________________________

Matrix Multiplication giving wrong output [duplicate]

This question already has an answer here:
Unable to execute device kernel in CUDA
(1 answer)
Closed 7 years ago.
What I am attempting to do is Multiply Matrix A & Matrix B and then from the product matrix I get the index of the maximum value per column. But unfortunately, only the first 128*128 values of the matrix multiplication are correct while others are just garbage. I do not quite understand how this works. I request you to kindly guide me with this ..
#include<stdio.h>
#include "cuda.h"
#include<stdlib.h>
#define blockD 32
const int wA = 128;
const int hA = 4096;
const int wB = 4096;
const int hB = wA;
main(void){
void MatrixMultiplication(float *, float *, float *, float *);
int size_A = wA * hA * sizeof(float);
int size_B = wB * hB * sizeof(float);
int size_C = wB * hA * sizeof(float);
int size_max = 2 * wB * sizeof(float);
float *M, *N, *P, *C;
// allocate memory on the CPU
M = (float*)malloc(size_A);
N = (float*)malloc(size_B);
P = (float*)malloc(size_max);
C = (float*)malloc(size_C);
// initialize the matrices
for (int y=0; y < hA; y++) {
for (int x=0; x < wA; x++){
M[y*wA + x] = 32; //x + y*wA;
}
}
for (int y=0; y<hB; y++) {
for (int x=0; x<wB; x++){
N[y*wB + x] = 21; //x + y*wB;
}
}
MatrixMultiplication(M, N, P, C);
//Write
FILE *f1;
int i,j;
f1 = fopen("C.txt","w");
for(i = hA - 2 ; i < hA; i ++){
for(j = 0; j < wB; j++){
fprintf(f1,"%d\t",int(C[i*wB + j]));
}
fprintf(f1,"\n");
}
fclose(f1);
// free the memory allocated on the CPU
free( M );
free( N );
free( P );
free( C );
cudaDeviceReset();
return 0;
}
__device__ void MaxFunction(float* Pd, float* max)
{
int x = (threadIdx.x + blockIdx.x * blockDim.x);
int y = (threadIdx.y + blockIdx.y * blockDim.y);
int k = 0;
int temp = 0; int temp_idx = 0;
for (k = 0; k < wB; ++k) {
if(Pd[x*wB + k] > temp){
temp = Pd[x*wB + k];
temp_idx = x*wB + k;
}
}
max[y*2 + 0] = temp;
max[y*2 + 1] = temp_idx;
}
__global__ void MatrixMulKernel(float* Md, float* Nd, float* Pd, float* max)
{
// declare cache in the shared memory
__shared__ float Mds[blockD][blockD];
__shared__ float Nds[blockD][blockD];
float Pvalue = 0;
// Loop over the Md and Nd block dimension required to compute the Pd element
for (int m = (wA * blockD * blockIdx.y), n = (blockD * blockIdx.x);
m < ((wA * blockD * blockIdx.y)+wA-1);
m += blockD, n += (blockD*hB)){
// collaboratively loading of Md and Nd blocks into shared memory
Mds[threadIdx.y][threadIdx.x] = Md[m + wA * threadIdx.y + threadIdx.x];
Nds[threadIdx.y][threadIdx.x] = Nd[n + wA * threadIdx.y + threadIdx.x];
__syncthreads();
// keep track of the running sum
for (int k = 0; k < blockD; k++)
Pvalue += Mds[threadIdx.y][k] * Nds[k][threadIdx.x];
__syncthreads();
}
// write back to the global memory
int p = hB * blockD * blockIdx.y + blockD * blockIdx.x;
Pd[p + hB * threadIdx.y + threadIdx.x] = Pvalue;
__syncthreads();
MaxFunction(Pd, max);
}
void MatrixMultiplication(float *M, float *N, float *P, float *C) {
int size_A = wA * hA * sizeof(float);
int size_B = wB * hB * sizeof(float);
int size_C = wB * hA * sizeof(float);
int size_max = 2 * wB * sizeof(float);
float *Md, *Nd, *Pd, *max;
// allocate memory on the GPU
cudaMalloc((void**)&Md, size_A);
cudaMalloc((void**)&Nd, size_B);
cudaMalloc((void**)&Pd, size_C);
cudaMalloc((void**)&max, size_max);
// transfer M and N to device memory
cudaMemcpy(Md, M, size_A, cudaMemcpyHostToDevice);
cudaMemcpy(Nd, N, size_B, cudaMemcpyHostToDevice);
// kernel invocation code
dim3 dimBlock(blockD, blockD);
dim3 dimGrid(wA/blockD, hB/blockD);
//Execute Kernel
MatrixMulKernel<<<dimGrid, dimBlock>>>( Md, Nd, Pd, max);
// transfer P from device
cudaMemcpy(P, max, size_max, cudaMemcpyDeviceToHost);
cudaMemcpy(C, Pd, size_C, cudaMemcpyDeviceToHost);
// free the memory allocated on the GPU
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
cudaFree(max);
}
In your code you seem to have more than one problem. One of the problems is, in place of this:
dim3 dimGrid(wA/blockD, hB/blockD);
You should have this:
dim3 dimGrid(wB/blockD, hA/blockD);
Ultimately you need one thread in your grid for each output point. Your formulation was giving you a grid of 4 blocks by 4 blocks, whereas you need a grid of 128 blocks by 128 blocks.
The other problem I found with your code was in these lines in the kernel:
int p = hB * blockD * blockIdx.y + blockD * blockIdx.x;
Pd[p + hB * threadIdx.y + threadIdx.x] = Pvalue;
They are not indexing properly through the output array. Rather than try to sort it out using your scheme, I used this instead:
Pd[(threadIdx.x + (blockIdx.x * blockDim.x)) + ((threadIdx.y + (blockIdx.y * blockDim.y))*(gridDim.x*blockDim.x))] = Pvalue;
When I made the above two changes to your code, I got what I believe are correct results throughout the array. And it took about 32 seconds on my machine to run it. (Note that I haven't tried fixing your original max-finding code -- see below for a better approach.)
Based on your previous question, you seemed to be concerned about speed. If you want to do fast matrix multiply, you should use cublas. The following code shows how to use cublas to multiply two ordinary C-style matrices (they don't have to be square). I've also included a column-max finding kernel that will be fast when the number of columns is large (say, over 500 or so. You have 4096 columns in your example). For small numbers of columns, there may be quicker ways to perform this function, but small numbers of columns also suggests that the overall problem size may be small and so speed (of this piece of code) will not really be an issue.
Here's the code:
#include <stdio.h>
#include <cublas_v2.h>
#define VERBOSE 1
#define nTPB 64
#define ROW_A 4
#define COL_A 4
#define ROW_B COL_A
#define COL_B 4
#define ROW_C ROW_A
#define COL_C COL_B
#define SIZ_A (ROW_A*COL_A)
#define SIZ_B (ROW_B*COL_B)
#define SIZ_C (ROW_C*COL_C)
// error check macros
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for CUBLAS V2 API
#define cublasCheckErrors(fn) \
do { \
cublasStatus_t __err = fn; \
if (__err != CUBLAS_STATUS_SUCCESS) { \
fprintf(stderr, "Fatal cublas error: %d (at %s:%d)\n", \
(int)(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void col_max(float *mat, float *max, unsigned int *midx, unsigned int rows, unsigned int cols){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < cols){
float tempmax = mat[idx];
unsigned int tempmidx = 0;
for (int i = 1; i< rows; i++)
if (mat[idx + (i*cols)] > tempmax){
tempmax = mat[idx + (i*cols)];
tempmidx = i;}
max[idx] = tempmax;
midx[idx] = tempmidx;
}
}
int main(){
float *h_A, *h_B, *h_C, *d_A, *d_B, *d_C, *h_max, *d_max;
unsigned int *h_idx, *d_idx;
h_A = (float *)malloc(SIZ_A*sizeof(float));
if (h_A==0) {printf("malloc fail\n"); return -1;}
h_B = (float *)malloc(SIZ_B*sizeof(float));
if (h_B==0) {printf("malloc fail\n"); return -1;}
h_C = (float *)malloc(SIZ_C*sizeof(float));
if (h_C==0) {printf("malloc fail\n"); return -1;}
h_max = (float *)malloc(COL_C*sizeof(float));
if (h_max==0) {printf("malloc fail\n"); return -1;}
h_idx = (unsigned int*)malloc(COL_C*sizeof(unsigned int));
if (h_idx==0) {printf("malloc fail\n"); return -1;}
cudaMalloc((void **)&d_A, SIZ_A*sizeof(float));
cudaMalloc((void **)&d_B, SIZ_B*sizeof(float));
cudaMalloc((void **)&d_C, SIZ_C*sizeof(float));
cudaMalloc((void **)&d_max, COL_C*sizeof(float));
cudaMalloc((void **)&d_idx, COL_C*sizeof(unsigned int));
cudaCheckErrors("cuda malloc fail");
// initialize data
for (int i=0; i< SIZ_A; i++) h_A[i] = (float)(i+1);
for (int i=0; i< SIZ_B; i++) h_B[i] = (float)(i+2);
cudaMemcpy(d_A, h_A, SIZ_A*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, SIZ_B*sizeof(float), cudaMemcpyHostToDevice);
cudaCheckErrors("cuda memcpy 1 fail");
const float alpha = 1.0f;
const float beta = 0.0f;
cublasHandle_t handle;
cublasCheckErrors(cublasCreate(&handle));
// C = A*B
// due to cublas expecting column-major storage, parameters
// are scrambled
cublasCheckErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, COL_B, ROW_A, COL_A, &alpha, d_B, COL_B, d_A, COL_A, &beta, d_C, COL_C));
cudaMemcpy(h_C, d_C, SIZ_C*sizeof(float), cudaMemcpyDeviceToHost);
cudaCheckErrors("cuda memcpy 2 fail");
col_max<<<(COL_C + nTPB - 1)/nTPB, nTPB>>>(d_C, d_max, d_idx, ROW_C, COL_C);
cudaCheckErrors("kernel launch fail");
cudaMemcpy(h_max, d_max, COL_C*sizeof(float), cudaMemcpyDeviceToHost);
cudaMemcpy(h_idx, d_idx, COL_C*sizeof(unsigned int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cuda memcpy 3 fail/kernel fail");
if (VERBOSE){
printf("A: \n");
for (int i=0; i< ROW_A; i++){
for (int j=0; j< COL_A; j++)
printf("%7.5G", h_A[j+(i*COL_A)]);
printf("\n");}
printf("B: \n");
for (int i=0; i< ROW_B; i++){
for (int j=0; j< COL_B; j++)
printf("%7.5G", h_B[j+(i*COL_B)]);
printf("\n");}
printf("C = A*B: \n");
for (int i=0; i< ROW_C; i++){
for (int j=0; j< COL_C; j++)
printf("%7.5G", h_C[j+(i*COL_C)]);
printf("\n");}
printf("COLUMN MAX:\n");
for (int i=0; i< COL_C; i++)
printf("%7.5G", h_max[i]);
printf("\nCOLUMN MAX IDX:\n");
for (int i=0; i< COL_C; i++)
printf("%7d", h_idx[i]);
}
printf("\n finished!\n");
return 0;
}
Here's what I used to compile:
$ nvcc -arch=sm_20 -O3 -o t221 t221.cu -lcublas
And here's the sample output:
$ cuda-memcheck ./t221
========= CUDA-MEMCHECK
A:
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
B:
2 3 4 5
6 7 8 9
10 11 12 13
14 15 16 17
C = A*B:
100 110 120 130
228 254 280 306
356 398 440 482
484 542 600 658
COLUMN MAX:
484 542 600 658
COLUMN MAX IDX:
3 3 3 3
finished!
========= ERROR SUMMARY: 0 errors
$
When I extended my code to handle the same sizes you indicated, (A = 4096x128, B=128x4096) it took about 1 second on my machine. So it's much faster than your code. However, when I take your code and comment out your call to MaxFunction in the kernel, it also only takes about 1 second to compute the matrix multiply result. So if you wanted to keep your matrix multiply code (i.e. not use cublas) you could break the code into 2 kernels, and use your multiply routine in the first kernel with my max-finding routine (col_max) in the second kernel, and also probably get a pretty fast result.
As #talonmies indicated, if you are running on a windows machine, be sure you are aware of the ramifications of windows TDR. (search that in the upper right corner search box if needed)