Cuda Exceptions - cuda

I am doing something in CUDA (FFT), but I have no idea why it is generating exceptions when calling the kernel function.
All includes and definitions:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <time.h>
#define CPU_ARRAY_SIZE 1024 // 1024, 2048, 4096 8192
#define GPU_ARRAY_SIZE 512 //
#define THREAD_SIZE 16 // fixed
#define BLOCK_SIZE (GPU_ARRAY_SIZE/THREAD_SIZE) // 32
#define PI 3.14
As I am running it in a NVIDIA GTX480, I thought it could be the shared memory space, although it doesn't seem to be (as there are "some many" shared variables). So, I aws changing the GPU_ARRAY_SIZE to see how it works, and it was giving me different results when I define it as 32, 64, 256, 512 (in the 512 case, it returns ALL zeros, which I guess CUDA couldn't make anything - in other cases, it returns weird, as I don't know the reason why it jumps 16 cells without any calculation). In most cases, in the Output window of my Microsoft Visual Studio, it returns billions of exceptions of the style "First-chance exception at 0x75b9b9bc in .exe: Microsoft C++ exception: cudaError_enum at memory location ". Before you ask me to debug, I cannot debug it, as the VS doesn't do that for files that are not recognized by VS (like .cpp - at least this theory works in my case).
Do you guys have any idea for the questions:
1. why is it generating exceptions?
2. why is it calculating, what it should do for every cell in every block, just within few cells
How could I solve this problem... any idea?
Kernel function:
__global__ void twiddle_factor(double *d_isub_matrix, double *d_osub_matrix)
{
__shared__ double block[THREAD_SIZE][THREAD_SIZE];
__shared__ double spectrum[THREAD_SIZE][THREAD_SIZE];
__shared__ double sum_cos[THREAD_SIZE][THREAD_SIZE]; // declaring the shared sum_cos.. similarly for sum_sin
__shared__ double sum_sin[THREAD_SIZE][THREAD_SIZE];
__shared__ double local_cos[THREAD_SIZE][THREAD_SIZE]; // declaring the shared sum_cos.. similarly for sum_sin
__shared__ double local_sin[THREAD_SIZE][THREAD_SIZE];
unsigned int xIndex = threadIdx.x + blockIdx.x* blockDim.x;
unsigned int yIndex = threadIdx.y + blockIdx.y* blockDim.y;
int u;
int x=0,y=0;
int tx = threadIdx.x;
int ty = threadIdx.y;
double sum_sines=0.0,sum_cosines=0.0;
double angle=(2*PI)/GPU_ARRAY_SIZE;
block[tx][ty] = d_isub_matrix[yIndex*GPU_ARRAY_SIZE+xIndex];
__syncthreads();
//for every column!
for(u=0; u<THREAD_SIZE; u++)
{
/* All threads calculate its own sin and cos value. */
local_sin[tx][ty] = block[tx][ty] * sin((angle*ty)*u);
local_cos[tx][ty] = block[tx][ty] * cos((angle*ty)*u);
/* Only one row is activate. The thread in row adds all element of its column. */
if (ty == u)
{
sum_sines = 0.0;
sum_cosines = 0.0;
/* Access each column to add all elements of the column.*/
for (y=0; y<THREAD_SIZE; y++)
{
sum_sines += local_sin[tx][y];
sum_cosines += local_cos[tx][y];
}
//if (sum_sines < 0)
//sum_sin[u][tx] = ((-1)*sum_sines)/GPU_ARRAY_SIZE;
//else
sum_sin[u][tx] = sum_sines/GPU_ARRAY_SIZE;
//if (sum_cosines < 0)
//sum_cos[u][tx] = ((-1)*sum_cosines)/GPU_ARRAY_SIZE;
//else
sum_cos[u][tx] = sum_cosines/GPU_ARRAY_SIZE;
}
__syncthreads();
}
spectrum[tx][ty] = sqrt((double)pow(sum_sin[tx][ty],2)
+(double)pow(sum_cos[tx][ty],2));
__syncthreads();
block[tx][ty] = spectrum[tx][ty];
__syncthreads();
//for every row!
for(u=0; u<THREAD_SIZE; u++)
{
/* All threads calculate its own sin and cos value. */
local_sin[tx][ty] = block[tx][ty] * sin((angle*ty)*u);
local_cos[tx][ty] = block[tx][ty] * cos((angle*ty)*u);
/* Only one column is activate. The thread in colum adds all element of its row. */
if (tx == u)
{
sum_sines = 0.0;
sum_cosines = 0.0;
for (x=0; x<THREAD_SIZE; x++)
{
sum_sines += local_sin[x][ty];
sum_cosines += local_cos[x][ty];
}
//if (sum_sines < 0)
//sum_sin[ty][u] = ((-1)*sum_sines)/GPU_ARRAY_SIZE;
//else
sum_sin[ty][u] = sum_sines/GPU_ARRAY_SIZE;
//if (sum_cosines < 0)
//sum_cos[ty][u] = ((-1)*sum_cosines)/GPU_ARRAY_SIZE;
//else
sum_cos[ty][u] = sum_cosines/GPU_ARRAY_SIZE;
}
__syncthreads();
}
spectrum[tx][ty] = sqrt((double)pow(sum_sin[tx][ty],2)+(double)pow(sum_cos[tx][ty],2));
__syncthreads();
/* Transpose! I think this is not necessary part. */
d_osub_matrix[xIndex*GPU_ARRAY_SIZE + yIndex] = spectrum[threadIdx.y][threadIdx.x];
__syncthreads();
}
The main function:
int main(int argc, char** argv)
{
int i,j, w, h, sw, sh;
int numSubblock = CPU_ARRAY_SIZE / GPU_ARRAY_SIZE;
double *d_isub_matrix,*d_osub_matrix;
double *big_matrix = new double[CPU_ARRAY_SIZE*CPU_ARRAY_SIZE];
double *big_matrix2 = new double[CPU_ARRAY_SIZE*CPU_ARRAY_SIZE];
double *isub_matrix = new double[GPU_ARRAY_SIZE*GPU_ARRAY_SIZE];
double *osub_matrix = new double[GPU_ARRAY_SIZE*GPU_ARRAY_SIZE];
cudaEvent_t start,stop;
float elapsedtime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
for (i=0; i<CPU_ARRAY_SIZE; i++)
{
for (j=0; j<CPU_ARRAY_SIZE; j++)
big_matrix[i*CPU_ARRAY_SIZE + j] = rand();//i*CPU_ARRAY_SIZE + j;
}
cudaEventRecord(start,0);
//cudaMalloc((void**)&d_isub_matrix,(GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float)*2);
//cudaMalloc((void**)&d_osub_matrix,(GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float)*2);
for(i = 0; i < numSubblock; i++)
{
for (j=0; j < numSubblock; j++)
{
// start position of subarea of big array
cudaMalloc((void**)&d_isub_matrix,(GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float));
cudaMalloc((void**)&d_osub_matrix,(GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float));
h = i*GPU_ARRAY_SIZE;
w = j*GPU_ARRAY_SIZE;
//printf("h = %d, w=%d",h,w);
//system("PAUSE");
// move subarea of big array into isub array.
for (sh = 0; sh < GPU_ARRAY_SIZE; sh++)
{
for (sw = 0; sw <GPU_ARRAY_SIZE; sw++)
{
isub_matrix[sh*GPU_ARRAY_SIZE+sw] = big_matrix[(h+sh)*CPU_ARRAY_SIZE + (w+sw)];
}
}
cudaMemcpy(d_isub_matrix,isub_matrix,((GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float)),cudaMemcpyHostToDevice);
//call the cuda kernel
dim3 blocks(BLOCK_SIZE, BLOCK_SIZE);
dim3 threads(THREAD_SIZE, THREAD_SIZE);
twiddle_factor<<<blocks, threads>>>(d_isub_matrix,d_osub_matrix);
cudaMemcpy(osub_matrix,d_osub_matrix,((GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float)),cudaMemcpyDeviceToHost);
for (sh = 0; sh < GPU_ARRAY_SIZE; sh++)
{
for (sw = 0; sw <GPU_ARRAY_SIZE; sw++)
{
big_matrix2[(h+sh)*CPU_ARRAY_SIZE + (w+sw)] = osub_matrix[sh*GPU_ARRAY_SIZE+sw];
printf(" sh %d sw %d %lf \n", sh, sw, osub_matrix[sh*GPU_ARRAY_SIZE+sw]);
}
}
printf("passei por aqui algumas vezes\n");
cudaFree(d_osub_matrix);
cudaFree(d_isub_matrix);
}
}
// cudaFree(d_osub_matrix);
// cudaFree(d_isub_matrix);
//Stop the time
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedtime,start,stop);
//showing the processing time
printf("The processing time took... %fms to execute everything",elapsedtime);
system("PAUSE");
for (sh = 0; sh < CPU_ARRAY_SIZE; sh++)
{
for (sw = 0; sw <CPU_ARRAY_SIZE; sw++)
{
printf(" sh %d sw %d %lf \n", sh, sw, big_matrix2[sh*CPU_ARRAY_SIZE+sw]);
}
}
system("PAUSE");
// I guess the result is "[1][0] = [1], [1][512] = [513], [513][0] = [524289], [513][512] = [524801]".
}

By a short look the problem could and should be the folling lines:
// start position of subarea of big array
cudaMalloc((void**)&d_isub_matrix,(GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float));
cudaMalloc((void**)&d_osub_matrix,(GPU_ARRAY_SIZE*GPU_ARRAY_SIZE)*sizeof(float));
You are allocating just to few memory for your double values on the GPU. Your sub matrix is allocated with 4 byte per point where 8 byte are needed.

Related

CUBLAS batch and matrix sizes [duplicate]

Some background info on the problem I am trying to speed up using CUDA:
I have a large number of small/moderate same-sized linear systems I need to solve independently. Each linear system is square, real, dense, invertible, and non-symmetric. These are actually matrix systems so each system look like, AX = B, where A, X, and B are (n x n) matrixes.
In this previous question I ask CUBLAS batch and matrix sizes, where I learn cuBLAS batch operations give best performance for matrix of size 100x100 or smaller.
I still have an issue because the matrices I am working with have 100 < n < 700. So, the matrices are of moderate size where cuBLAS batch operations are not give best performance, and regular BLAS (cusolverDnDgetrf, cusolverDnDgetrs) also are not give better performance than MATLAB (look at timings below).
I did some timing compared to MATLAB, for solving a single system, and found regular BLAS is better for matrices of size (4096x4096) or larger. I make a random matrix of size (n x n), for n=64,256,512,1024,4096,16384, and only time the factorization and back/forward solve, no transfers across PCIE.
DOUBLE PRECISION CUDA (GTX 1080ti) vs MATLAB (backslash)
(GPU) 64: 0.001157 sec
(MATLAB) 64: 0.000205 sec
(GPU) 256: 0.01161 sec
(MATLAB) 256: 0.007762 sec
(GPU) 512: 0.026348 sec
(MATLAB) 512: 0.008550 sec
(GPU) 1024: 0.064357 sec
(MATLAB) 1024: 0.036280 sec
(GPU) 4096: 0.734908 sec
(MATLAB) 4096: 1.174442 sec
(GPU) 16384: 32.962229 sec (MATLAB) 16384: 68.691236 sec
These timing make me conclude that iterating one by one over my matrices calling non-batch inversion method will be slower than MATLAB. Also, for my moderate sized matrices, batch cuBLAS batch inversion method will not perform well, according to CUBLAS batch and matrix sizes.
Is there other approach I should consider to speed up my code with CUDA? Or am I misunderstanding something?
/* How to use
* ./cuSolverDn_LinearSolver // Default: cholesky
* ./cuSolverDn_LinearSolver -R=chol -filefile> // cholesky factorization
* ./cuSolverDn_LinearSolver -R=lu -file<file> // LU with partial pivoting
* ./cuSolverDn_LinearSolver -R=qr -file<file> // QR factorization
*
* Remark: the absolute error on solution x is meaningless without knowing condition number of A.
* The relative error on residual should be close to machine zero, i.e. 1.e-15.
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>
#include <cuda_runtime.h>
#include "cublas_v2.h"
#include "cusolverDn.h"
#include "helper_cuda.h"
#include "helper_cusolver.h"
int linearSolverLU(
cusolverDnHandle_t handle,
int n,
const double *Acopy,
int lda,
const double *b,
double *x)
{
int bufferSize = 0;
int *info = NULL;
double *buffer = NULL;
double *A = NULL;
int *ipiv = NULL; // pivoting sequence
int h_info = 0;
double start, stop;
double time_solve;
checkCudaErrors(cusolverDnDgetrf_bufferSize(handle, n, n, (double*)Acopy, lda, &bufferSize));
checkCudaErrors(cudaMalloc(&info, sizeof(int)));
checkCudaErrors(cudaMalloc(&buffer, sizeof(double)*bufferSize));
checkCudaErrors(cudaMalloc(&A, sizeof(double)*lda*n));
checkCudaErrors(cudaMalloc(&ipiv, sizeof(int)*n));
// prepare a copy of A because getrf will overwrite A with L
checkCudaErrors(cudaMemcpy(A, Acopy, sizeof(double)*lda*n, cudaMemcpyDeviceToDevice));
checkCudaErrors(cudaMemset(info, 0, sizeof(int)));
start = second();
start = second();
checkCudaErrors(cusolverDnDgetrf(handle, n, n, A, lda, buffer, ipiv, info));
checkCudaErrors(cudaMemcpy(&h_info, info, sizeof(int), cudaMemcpyDeviceToHost));
if ( 0 != h_info ){
fprintf(stderr, "Error: LU factorization failed\n");
}
//checkCudaErrors(cudaMemcpy(x, b, sizeof(double)*n, cudaMemcpyDeviceToDevice));
checkCudaErrors(cudaMemcpy(x, b, sizeof(double)*lda*n, cudaMemcpyDeviceToDevice));
//checkCudaErrors(cusolverDnDgetrs(handle, CUBLAS_OP_N, n, 1, A, lda, ipiv, x, n, info));
checkCudaErrors(cusolverDnDgetrs(handle, CUBLAS_OP_N, n, n, A, lda, ipiv, x, n, info));
checkCudaErrors(cudaDeviceSynchronize());
stop = second();
time_solve = stop - start;
fprintf (stdout, "timing: LU = %10.6f sec\n", time_solve);
if (info ) { checkCudaErrors(cudaFree(info )); }
if (buffer) { checkCudaErrors(cudaFree(buffer)); }
if (A ) { checkCudaErrors(cudaFree(A)); }
if (ipiv ) { checkCudaErrors(cudaFree(ipiv));}
return 0;
}
void generate_random_dense_matrix(int M, int N, double **outA)
{
int i, j;
double rMax = (double)RAND_MAX;
double *A = (double *)malloc(sizeof(double) * M * N);
// For each column
for (j = 0; j < N; j++)
{
// For each row
for (i = 0; i < M; i++)
{
double dr = (double)rand();
A[j * M + i] = (dr / rMax) * 100.0;
//printf("A[j * M + i] = %f \n",A[j * M + i]);
}
}
*outA = A;
}
int main (int argc, char *argv[])
{
struct testOpts opts;
cusolverDnHandle_t handle = NULL;
cublasHandle_t cublasHandle = NULL; // used in residual evaluation
cudaStream_t stream = NULL;
int rowsA = 0; // number of rows of A
int colsA = 0; // number of columns of A
int nnzA = 0; // number of nonzeros of A
int baseA = 0; // base index in CSR format
int lda = 0; // leading dimension in dense matrix
// CSR(A) from I/O
int *h_csrRowPtrA = NULL;
int *h_csrColIndA = NULL;
double *h_csrValA = NULL;
double *h_A = NULL; // dense matrix from CSR(A)
double *h_x = NULL; // a copy of d_x
double *h_b = NULL; // b = ones(m,1)
double *h_r = NULL; // r = b - A*x, a copy of d_r
double *d_A = NULL; // a copy of h_A
double *d_x = NULL; // x = A \ b
double *d_b = NULL; // a copy of h_b
double *d_r = NULL; // r = b - A*x
// the constants are used in residual evaluation, r = b - A*x
const double minus_one = -1.0;
const double one = 1.0;
double x_inf = 0.0;
double r_inf = 0.0;
double A_inf = 0.0;
int errors = 0;
colsA = 660;
rowsA = colsA;
int NN = colsA;
int MM = rowsA;
lda = rowsA;
// Generate inputs
srand(9384);
generate_random_dense_matrix(MM, NN, &h_A);
generate_random_dense_matrix(MM, NN, &h_b);
parseCommandLineArguments(argc, argv, opts);
if (NULL == opts.testFunc)
{
//opts.testFunc = "chol"; // By default running Cholesky as NO solver selected with -R option.
opts.testFunc = "lu";
//opts.testFunc = "qr";
}
findCudaDevice(argc, (const char **)argv);
/*
printf("step 1: read matrix market format\n");
if (opts.sparse_mat_filename == NULL)
{
opts.sparse_mat_filename = sdkFindFilePath("gr_900_900_crg.mtx", argv[0]);
if (opts.sparse_mat_filename != NULL)
printf("Using default input file [%s]\n", opts.sparse_mat_filename);
else
printf("Could not find gr_900_900_crg.mtx\n");
}
else
{
printf("Using input file [%s]\n", opts.sparse_mat_filename);
}
if (opts.sparse_mat_filename == NULL)
{
fprintf(stderr, "Error: input matrix is not provided\n");
return EXIT_FAILURE;
}
if (loadMMSparseMatrix<double>(opts.sparse_mat_filename, 'd', true , &rowsA, &colsA,
&nnzA, &h_csrValA, &h_csrRowPtrA, &h_csrColIndA, true))
{
exit(EXIT_FAILURE);
}
baseA = h_csrRowPtrA[0]; // baseA = {0,1}
printf("sparse matrix A is %d x %d with %d nonzeros, base=%d\n", rowsA, colsA, nnzA, baseA);
if ( rowsA != colsA )
{
fprintf(stderr, "Error: only support square matrix\n");
exit(EXIT_FAILURE);
}
printf("step 2: convert CSR(A) to dense matrix\n");
lda = opts.lda ? opts.lda : rowsA;
if (lda < rowsA)
{
fprintf(stderr, "Error: lda must be greater or equal to dimension of A\n");
exit(EXIT_FAILURE);
}
*/
//h_A = (double*)malloc(sizeof(double)*lda*colsA);
h_x = (double*)malloc(sizeof(double)*lda*colsA);
//h_b = (double*)malloc(sizeof(double)*rowsA);
h_r = (double*)malloc(sizeof(double)*lda*rowsA);
assert(NULL != h_A);
assert(NULL != h_x);
assert(NULL != h_b);
assert(NULL != h_r);
/*
memset(h_A, 0, sizeof(double)*lda*colsA);
for(int row = 0 ; row < rowsA ; row++)
{
const int start = h_csrRowPtrA[row ] - baseA;
const int end = h_csrRowPtrA[row+1] - baseA;
for(int colidx = start ; colidx < end ; colidx++)
{
const int col = h_csrColIndA[colidx] - baseA;
const double Areg = h_csrValA[colidx];
h_A[row + col*lda] = Areg;
}
}
printf("step 3: set right hand side vector (b) to 1\n");
for(int row = 0 ; row < rowsA ; row++)
{
h_b[row] = 1.0;
}
*/
// verify if A is symmetric or not.
if ( 0 == strcmp(opts.testFunc, "chol") )
{
int issym = 1;
for(int j = 0 ; j < colsA ; j++)
{
for(int i = j ; i < rowsA ; i++)
{
double Aij = h_A[i + j*lda];
double Aji = h_A[j + i*lda];
if ( Aij != Aji )
{
issym = 0;
break;
}
}
}
if (!issym)
{
printf("Error: A has no symmetric pattern, please use LU or QR \n");
exit(EXIT_FAILURE);
}
}
checkCudaErrors(cusolverDnCreate(&handle));
checkCudaErrors(cublasCreate(&cublasHandle));
checkCudaErrors(cudaStreamCreate(&stream));
checkCudaErrors(cusolverDnSetStream(handle, stream));
checkCudaErrors(cublasSetStream(cublasHandle, stream));
checkCudaErrors(cudaMalloc((void **)&d_A, sizeof(double)*lda*colsA));
checkCudaErrors(cudaMalloc((void **)&d_x, sizeof(double)*lda*colsA));
checkCudaErrors(cudaMalloc((void **)&d_b, sizeof(double)*lda*rowsA));
checkCudaErrors(cudaMalloc((void **)&d_r, sizeof(double)*lda*rowsA));
printf("step 4: prepare data on device\n");
checkCudaErrors(cudaMemcpy(d_A, h_A, sizeof(double)*lda*colsA, cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_b, h_b, sizeof(double)*lda*rowsA, cudaMemcpyHostToDevice));
printf("step 5: solve A*x = b \n");
// d_A and d_b are read-only
if ( 0 == strcmp(opts.testFunc, "chol") )
{
linearSolverCHOL(handle, rowsA, d_A, lda, d_b, d_x);
}
else if ( 0 == strcmp(opts.testFunc, "lu") )
{
//printf("hi \n");
linearSolverLU(handle, rowsA, d_A, lda, d_b, d_x);
}
else if ( 0 == strcmp(opts.testFunc, "qr") )
{
linearSolverQR(handle, rowsA, d_A, lda, d_b, d_x);
}
else
{
fprintf(stderr, "Error: %s is unknown function\n", opts.testFunc);
exit(EXIT_FAILURE);
}
printf("step 6: evaluate residual\n");
checkCudaErrors(cudaMemcpy(d_r, d_b, sizeof(double)*lda*rowsA, cudaMemcpyDeviceToDevice));
// r = b - A*x
checkCudaErrors(cublasDgemm_v2(
cublasHandle,
CUBLAS_OP_N,
CUBLAS_OP_N,
rowsA,
colsA,
colsA,
&minus_one,
d_A,
lda,
d_x,
rowsA,
&one,
d_r,
rowsA));
checkCudaErrors(cudaMemcpy(h_x, d_x, sizeof(double)*lda*colsA, cudaMemcpyDeviceToHost));
checkCudaErrors(cudaMemcpy(h_r, d_r, sizeof(double)*lda*rowsA, cudaMemcpyDeviceToHost));
x_inf = vec_norminf(colsA, h_x);
r_inf = vec_norminf(rowsA, h_r);
A_inf = mat_norminf(rowsA, colsA, h_A, lda);
printf("x[0] = %f\n", h_x[0]);
printf("r[0] = %f\n", h_r[0]);
printf("|b - A*x| = %E \n", r_inf);
printf("|A| = %E \n", A_inf);
printf("|x| = %E \n", x_inf);
printf("|b - A*x|/(|A|*|x|) = %E \n", r_inf/(A_inf * x_inf));
if (handle) { checkCudaErrors(cusolverDnDestroy(handle)); }
if (cublasHandle) { checkCudaErrors(cublasDestroy(cublasHandle)); }
if (stream) { checkCudaErrors(cudaStreamDestroy(stream)); }
if (h_csrValA ) { free(h_csrValA); }
if (h_csrRowPtrA) { free(h_csrRowPtrA); }
if (h_csrColIndA) { free(h_csrColIndA); }
if (h_A) { free(h_A); }
if (h_x) { free(h_x); }
if (h_b) { free(h_b); }
if (h_r) { free(h_r); }
if (d_A) { checkCudaErrors(cudaFree(d_A)); }
if (d_x) { checkCudaErrors(cudaFree(d_x)); }
if (d_b) { checkCudaErrors(cudaFree(d_b)); }
if (d_r) { checkCudaErrors(cudaFree(d_r)); }
return 0;
}
Try using two or more parallel streams (with one linear system each) on the GPU, possibly this helps utilizing a bigger part of the GPU.
For timing measurments and hardware utilization use the visual profiler instead of CPU time measurements.
Another point is, that the GTX (consumer) GPUs perform pretty bad on double preision. If you have the chance, try to use a Tesla GPU instead.
MATLAB provides a way to call the cublas batch interface for GPU arrays using pagefun.

CUDA_SAFE_CALL: an illegal memory access was encountered

I am trying to do simple matrix multiplication on CUDA. I know arrays can be flattened for passing it to the device. However I am using cudaMallocPitch and cudaMemcpy2d to do the multiplication. While executing the code below I get an error " illegal memory was encountered" when I try to copy the result onto the host I highly appreciate any advice on where I am going wrong. Thanks!
weights-first matrix,dim:30x784
input- second matrix,dim:784x100
results_d - result on the device(GPU)
result - result copied on the host
#include <stdio.h>
#include <math.h>
#include <cstdio>
#include <cstdlib>
#define CUDA_SAFE_CALL(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"CUDA_SAFE_CALL: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void MatrixMulKernel(double *input,double *weights,double *results_d,size_t in_pitch,size_t w1_pitch,size_t result_pitch)
{
int row = threadIdx.x;
int col= threadIdx.y;
double value;
double *result_matrix;
result_matrix = ((double*)((char*)results_d + row*result_pitch + col));
printf("%d",threadIdx);
for(int i =0 ; i < in_pitch ; i++)
{
double *element1 = ((double*)((char*)input + row*in_pitch) + i) ;
double *element2 = ((double*)((char*)weights + i*w1_pitch) + col);
value =+ (*element1) * (*element2);
}
*result_matrix = value;
}
int main()
{
static double arr1[30][784];
static double arr2[784][100];
static double result[30][100];
for (int i = 0 ; i < 30; i++)
{
for(int j =0;j <784 ; j ++)
arr1[i][j] = 5;
}
for (int i =0 ; i < 784; i ++)
{
for(int j=0;j < 100 ; j++)
arr2[i][j] = 3;
}
double *input;
double *weights;
double *results_d;
size_t in_pitch,w1_pitch,result_pitch;
//allocating memory in GPU for 2 inputs and result
CUDA_SAFE_CALL(cudaMallocPitch((void**)&input,&in_pitch,100*sizeof(double),784));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&weights,&w1_pitch,784*sizeof(double),30));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&results_d,&result_pitch,100*sizeof(double),30));
//Copy matrix from host to device
CUDA_SAFE_CALL(cudaMemcpy2D(input,in_pitch,arr2,100*sizeof(double),100*sizeof(double),784,cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy2D(weights,w1_pitch,arr1,784*sizeof(double),784*sizeof(double),30,cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy2D(results_d,result_pitch,result,100*sizeof(double),100*sizeof(double),30,cudaMemcpyHostToDevice));
//using GPU
dim3 dimGrid(1,1,1);
dim3 dimBlock(32,32,1);
printf("before kernel fucntion");
MatrixMulKernel<<<dimGrid, dimBlock>>>(input, weights,results_d,in_pitch,w1_pitch,result_pitch);
printf("after kernel fucntion");
cudaThreadSynchronize();
//copying back to host
CUDA_SAFE_CALL(cudaMemcpy2D(result,result_pitch,results_d,100*sizeof(double),100*sizeof(double),30,cudaMemcpyDeviceToHost));
//printing and seeing whether the result matrix has been updated
for (int i =0 ; i < 100; i ++)
{
for(int j=0;j < 30 ; j++)
{
printf("%f",result);
}
printf("\n");
}
CUDA_SAFE_CALL(cudaFree(input));
CUDA_SAFE_CALL(cudaFree(weights));
CUDA_SAFE_CALL(cudaFree(results_d));
return 0;
}
There are a number of errors in this code. First of all, it's not clear that doing pitched allocations is going to give any benefit here. Second, if you're serious about wanting fast matrix multiply performance, you should use CUBLAS.
Issues:
You don't seem to understand pitched allocations. The pitch value returned is a value in bytes. You cannot sensibly use that for a loop index for matrix multiply. Also, the pitch value is the overall width of the pitch allocation. It does not correspond to the valid data area. For that, you should use the appropriate matrix dimension.
Your code will not do a matrix multiplication over the entire matrix area. You are only creating a single block of 32x32 threads, but you need enough blocks/threads to cover the entire matrix area. This requires changes to your grid dimensions, passing matrix dimensions to your kernel, as well as a "thread check" in your kernel to prevent out-of-bounds access.
This construct for pitched access is not correct:
result_matrix = ((double*)((char*)results_d + row*result_pitch + col));
it does not match the other constructions you have for the 2 input matrices, it has a misplaced close parenthesis.
You have the sense of your two input matrices reversed. You are indexing into the input matrix as if it were the weight matrix, and vice-versa. We need to swap the sense of row, column and i to make these match the actual matrix dimensions.
Your final cudaMemcpy2D operation has the pitch values reversed:
cudaMemcpy2D(result,result_pitch,results_d,100*sizeof(double),100*sizeof(double),30,cudaMemcpyDeviceToHost)
^^^^^ ^^^^^
You forgot to initialize to zero your loop sum variable:
double value;
I don't know what you intended here, it should be += not =+:
value =+ ...
The following code has these issues addressed, and seems to run without error for me:
$ cat t104.cu
#include <stdio.h>
#include <math.h>
#include <cstdio>
#include <cstdlib>
const int d1 = 30;
const int d2 = 784;
const int d3 = 100;
double arr1[d1][d2];
double arr2[d2][d3];
double result[d1][d3];
#define CUDA_SAFE_CALL(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"CUDA_SAFE_CALL: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
__global__ void MatrixMulKernel(double *input,double *weights,double *results_d,size_t in_pitch,size_t w1_pitch,size_t result_pitch, int dim, int rrow, int rcol)
{
int col = threadIdx.x + blockDim.x*blockIdx.x;
int row= threadIdx.y + blockDim.y*blockIdx.y;
if ((row >= rrow) || (col >= rcol)) return;
double value = 0;
double *result_matrix;
result_matrix = ((double*)((char*)results_d + row*result_pitch) + col);
for(int i =0 ; i < dim ; i++)
{
double *element1 = ((double*)((char*)input + i*in_pitch) + col) ;
double *element2 = ((double*)((char*)weights + row*w1_pitch) + i);
value += (*element1) * (*element2);
}
*result_matrix = value;
}
int main()
{
for (int i = 0 ; i < d1; i++)
{
for(int j =0;j <d2 ; j ++)
arr1[i][j] = 5;
}
for (int i =0 ; i < d2; i ++)
{
for(int j=0;j < d3 ; j++)
arr2[i][j] = 3;
}
double *input;
double *weights;
double *results_d;
size_t in_pitch,w1_pitch,result_pitch;
//allocating memory in GPU for 2 inputs and result
CUDA_SAFE_CALL(cudaMallocPitch((void**)&input,&in_pitch,d3*sizeof(double),d2));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&weights,&w1_pitch,d2*sizeof(double),d1));
CUDA_SAFE_CALL(cudaMallocPitch((void**)&results_d,&result_pitch,d3*sizeof(double),d1));
//Copy matrix from host to device
CUDA_SAFE_CALL(cudaMemcpy2D(input,in_pitch,arr2,d3*sizeof(double),d3*sizeof(double),d2,cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy2D(weights,w1_pitch,arr1,d2*sizeof(double),d2*sizeof(double),d1,cudaMemcpyHostToDevice));
CUDA_SAFE_CALL(cudaMemcpy2D(results_d,result_pitch,result,d3*sizeof(double),d3*sizeof(double),d1,cudaMemcpyHostToDevice));
//using GPU
dim3 dimBlock(32,32,1);
dim3 dimGrid(((d3+dimBlock.x-1)/dimBlock.x),((d1+dimBlock.y-1)/dimBlock.y),1);
MatrixMulKernel<<<dimGrid, dimBlock>>>(input, weights,results_d,in_pitch,w1_pitch,result_pitch, d2, d1, d3);
//copying back to host
CUDA_SAFE_CALL(cudaMemcpy2D(result,d3*sizeof(double),results_d,result_pitch,d3*sizeof(double),d1,cudaMemcpyDeviceToHost));
//printing and seeing whether the result matrix has been updated
for (int i =0 ; i < d3; i ++)
{
for(int j=0;j < d1 ; j++)
{
printf("%f", result[j][i]);
}
printf("\n");
}
CUDA_SAFE_CALL(cudaFree(input));
CUDA_SAFE_CALL(cudaFree(weights));
CUDA_SAFE_CALL(cudaFree(results_d));
return 0;
}
$ nvcc -arch=sm_61 -o t104 t104.cu
$

Optimize vector matrix multiplication in cuda with large number of zeros

I am using the following kernel to optimize vector-matrix multiplication for the case where both the vector and the matrix have a large number of zeros. The use of this kernel may reduce the time taken for such a multiplication by up to half of the time taken by cublasSgemv, for the case where there are more than 90% zeros. But, it is still much longer than an equivalent blas gemm host call on Ubuntu 14.04
vec = 1 x m, mat = m x m and prod = 1 x m; all are in row-major order
m >= 5000
__global__ void calc_v_m(float *vec, float *mat, float *prod, int m)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
prod[x] = 0;
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if( mat[offset] != 0 && vec[i] != 0 )
prod[x] += vec[i] * mat[i*m+x];
}
}
}
What can be done to further enhance the performance of this kernel apart from libraries like cuSparse?
Would be nice if this optimization was compatible with Compute Capability of 1.2
Thanks
EDIT
Corrected: prod = 1 x m
GPU = Quadro FX 1800M, Cuda v.5.0 on Ubuntu 14.04
EDIT
Complete code that performs multiplication using i. blas, ii. cublas, iii. above kernel for m = 6000. Please enter 0, when asked to enter a value
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
#include <cublas_v2.h>
#include <math.h>
using namespace std;
const int m = 6000;
const int BS = 512; // threads per block
const int NB = ceil((float) m / BS); // number of blocks
__global__ void calc_v_m(float *vec, float *mat, float *prod, int m)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
prod[x] = 0;
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if( mat[offset] != 0 && vec[i] != 0 )
prod[x] += vec[i] * mat[i*m+x];
}
}
}
int main()
{
timespec blas_start, blas_end, cublas_start, cublas_end, opt_start, opt_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
float input;
cout<<"Enter a value to populate the vector (0 to make it sparse) ";
cin>>input;
// input martix A: every 600th element is non-zero i.e 90% zero
for(i = 0; i < m; i++)
{
A[i] = input;
if( i % 600 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
B[j*m + i] = (i==j);
//blas on host
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
//cblas_sgemv(CblasRowMajor, CblasTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);
clock_gettime(CLOCK_REALTIME, &blas_end);
/* for(i = 0; i < m; i++) printf("%f ", C[i]); */
//cublas section
cudaError_t cudaStat;
cublasHandle_t handle;
cublasCreate(&handle);
float *A_d, *B_d, *C_d;
cudaStat = cudaMalloc(&A_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for A_d\n");
cudaStat = cudaMalloc(&B_d, sizeof(float)*m*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for B_d\n");
cudaStat = cudaMalloc(&C_d, sizeof(float)*m);
if(cudaStat != cudaSuccess) printf("Error Allocating Memory for C_d\n");
cudaMemcpy(A_d, A, sizeof(float)*m, cudaMemcpyHostToDevice);
cudaMemcpy(B_d, B, sizeof(float)*m*m, cudaMemcpyHostToDevice);
float alpha = 1.0f, beta = 0.0f;
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_start);
cublasSgemv(handle, CUBLAS_OP_N, m, m, &alpha, B_d, m, A_d, 1, &beta, C_d, 1);
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &cublas_end);
cudaMemcpy(C, C_d, sizeof(float)*m, cudaMemcpyDeviceToHost);
/* for(i = 0; i < m; i++) printf("%f ", C[i]); */
// Call kernel having Optimization for Zeros
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &opt_start);
/////////////////// call kernel //////////////////
calc_v_m<<<NB, BS>>>(A_d, B_d, C_d, m);
//////////////////////////////////////////////////
cudaDeviceSynchronize();
clock_gettime(CLOCK_REALTIME, &opt_end);
cudaMemcpy(C, C_d, sizeof(float)*m, cudaMemcpyDeviceToHost);
/*for(i = 0; i < m; i++) printf("%f ", C[i]); */
// Print times
// blas time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"blas Time = "<< totaltime << "\n";
//cublas time
totalsec = (double)cublas_end.tv_sec - (double)cublas_start.tv_sec;
totalnsec = cublas_end.tv_nsec - cublas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"cublas Time = "<< totaltime << "\n";
//Optimized Kernel Time
totalsec = (double)opt_end.tv_sec - (double)opt_start.tv_sec;
totalnsec = opt_end.tv_nsec - opt_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Opt Kernel Time = "<< totaltime << "\n";
return 0;
}
Results
$ nvcc -arch=sm_12 blascomp.cu -o blascomp.o -lblas -lcublas
$ ./blascomp.o
Enter a value to populate the vector (0 to make it sparse) 0
blas Time = 0.000105207
cublas Time = 0.0070294
Opt Kernel Time = 0.00642797
At least on my system blas is still the fastest for such a scenario
Things get even more interesting if every '1200th' element instead of '600th' is set to 0
Enter a value to populate the vector (0 to make it sparse) 0
blas Time = 7.84e-05
cublas Time = 0.00698783
Opt Kernel Time = 0.00643042
The important thing to recognise here is that the gemv operation you are concerned with is fundamentally memory throughput limited on GPUs, rather than compute throughput limited. This implies that an "optimisation" as you have shown in your kernel:
__global__ void calc_v_m(float *vec, float *mat, float *prod, int m)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
prod[x] = 0;
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if( mat[offset] != 0 && vec[i] != 0 )
prod[x] += vec[i] * mat[i*m+x];
}
}
}
isn't really an optmisation at all, simply because the memory transactions are the performance bottleneck in the kernel, not the floating point arithmetic, and your code must perform most of the memory transactions irrespective of whether the multiply add operation will be performed because of zero detection or not.
Consider the following, instrumented version of roughly the same code:
__constant__ float cvec1[2];
__global__ void
__launch_bounds__(512,4)
calc_v_m1(const float* __restrict__ vec,
const float* __restrict__ mat,
float* __restrict__ prod,
int m,
int do_reads = 1,
int do_write = 1)
{
int x = blockDim.x * blockIdx.x + threadIdx.x;
if(x < m)
{
float res = 0;
float mval = cvec1[0], vval = cvec1[1];
#pragma unroll 8
for(int i = 0; i < m; i++)
{
int offset = i*m + x;
if (do_reads) {
mval = mat[offset];
vval = vec[i];
}
res += mval * vval;
}
if (do_write) prod[x] = res;
}
}
Here I have added two optional arguments which control whether the kernel will load from global memory, and whether the kernel will store to global memory. This allows me to quantify the performance impact of the memory loads, computation, and memory stores independently. The results using your test code are instructive:
Function nvprof time
-----------------------------------------------
cublasSgemv 942.75us
calc_v_m 2798.4us
calc_v_m1(do_reads=1, do_write=1) 962.40us
calc_v_m1(do_reads=1, do_write=0) 970.40us
calc_v_m1(do_reads=0, do_write=1) 55.166us
calc_v_m1(do_reads=0, do_write=0) 55.102us
[All benchmarking done on a GTX970 using the CUDA 7.5 release toolchain and CUBLAS 7.5 library]
In no particular order:
The full instrumented kernel runtime is within a few percent of the equivalent CUBLAS call
The memory fetches from global memory are the bottleneck
The actual computations in the kernel only constitute 5% of the kernel running time
The "fire-and-forget" nature of write operations in CUDA means that the latency of the write has no significant effect on throughput.
Your "optimised" kernel is considerably slower than either CUBLAS or the instrumented kernel, probably because all you are introducing is branch divergence without addressing the source of the kernel bottleneck (the latency of the memory loads).
The only times conditionally executing the FMAD operation makes sense would be in an architecture where memory has near zero latency and floating point throughput was severely constrained. The GPU definitely doesn't fall into that category.
The only other option for optimising this would be to exploit a priori information about the sparsity patterns in the LHS matrix to remove the need to read zero entries. Which is precisely what sparse matrix formats and linear algebra codes are designed to accommodate.

2D Convolution Incorrect Results Cuda Constant Memory

I'm struggling in the kernel code. I have updated this to include support files, but those were provided and should be correct.
This is one of my first GPU programs and I've spent several hours trying new things and I can't seem to get this right. It is compiling and running, but the results are incorrect.
I am basically having trouble understanding what exactly I need to be doing differently because this kernel is giving incorrect results. I'm trying to load a tile of the input image to shared memory (Ns[][], which I think I've done correctly) and apply the filter on the input image tile (which I am struggling with).
I would greatly appreciate it if someone who is more experienced could assist me in figuring out exactly where I've gone wrong and give me an idea how to resolve the issue. I appreciate your time and apologies if I've asked this question incorrectly.
main.cu:
#include <stdio.h>
#include "support.h"
#include "kernel.cu"
#include <time.h>
int main(int argc, char* argv[]){
Timer timer;
time_t t;
// Initialize host variables ----------------------------------------------
printf("\nSetting up the problem..."); fflush(stdout);
startTime(&timer);
Matrix M_h, N_h, P_h; // M: filter, N: input image, P: output image
Matrix N_d, P_d;
unsigned imageHeight, imageWidth;
cudaError_t cuda_ret;
dim3 dim_grid, dim_block;
/* Read image dimensions */
if (argc == 1) {
imageHeight = 600;
imageWidth = 1000;
} else if (argc == 2) {
imageHeight = atoi(argv[1]);
imageWidth = atoi(argv[1]);
} else if (argc == 3) {
imageHeight = atoi(argv[1]);
imageWidth = atoi(argv[2]);
} else {
printf("\n Invalid input parameters!"
"\n Usage: ./convolution # Image is 600 x 1000"
"\n Usage: ./convolution <m> # Image is m x m"
"\n Usage: ./convolution <m> <n> # Image is m x n"
"\n");
exit(0);
}
/* Allocate host memory */
M_h = allocateMatrix(FILTER_SIZE, FILTER_SIZE);
N_h = allocateMatrix(imageHeight, imageWidth);
P_h = allocateMatrix(imageHeight, imageWidth);
/* Initialize filter and images */
initMatrix(M_h);
initMatrix(N_h);
stopTime(&timer); printf("%f s\n", elapsedTime(timer));
printf(" Image: %u x %u\n", imageHeight, imageWidth);
printf(" Mask: %u x %u\n", FILTER_SIZE, FILTER_SIZE);
// Allocate device variables ----------------------------------------------
printf("Allocating device variables..."); fflush(stdout);
startTime(&timer);
N_d = allocateDeviceMatrix(imageHeight, imageWidth);
P_d = allocateDeviceMatrix(imageHeight, imageWidth);
cudaDeviceSynchronize();
stopTime(&timer); printf("%f s\n", elapsedTime(timer));
// Copy host variables to device ------------------------------------------
printf("Copying data from host to device..."); fflush(stdout);
startTime(&timer);
/* Copy image to device global memory */
copyToDeviceMatrix(N_d, N_h);
cudaMemcpyToSymbol(M_h, M_c,FILTER_SIZE*sizeof(float));
dim_grid = dim3(((N_h.width / BLOCK_SIZE) + 1), ((N_h.height / BLOCK_SIZE) + 1));
dim_block = dim3(BLOCK_SIZE, BLOCK_SIZE);
cudaDeviceSynchronize();
stopTime(&timer); printf("%f s\n", elapsedTime(timer));
// Launch kernel ----------------------------------------------------------
printf("Launching kernel..."); fflush(stdout);
startTime(&timer);
convolution<<<dim_grid, dim_block>>>(N_d, P_d);
cuda_ret = cudaDeviceSynchronize();
if(cuda_ret != cudaSuccess) FATAL("Unable to launch/execute kernel");
cudaDeviceSynchronize();
stopTime(&timer); printf("%f s\n", elapsedTime(timer));
// Copy device variables from host ----------------------------------------
printf("Copying data from device to host..."); fflush(stdout);
startTime(&timer);
copyFromDeviceMatrix(P_h, P_d);
cudaDeviceSynchronize();
stopTime(&timer); printf("%f s\n", elapsedTime(timer));
// Verify correctness -----------------------------------------------------
printf("Verifying results..."); fflush(stdout);
verify(M_h, N_h, P_h);
// Free memory ------------------------------------------------------------
freeMatrix(M_h);
freeMatrix(N_h);
freeMatrix(P_h);
freeDeviceMatrix(N_d);
freeDeviceMatrix(P_d);
return 0;
}
kernel.cu:
__constant__ float M_c[FILTER_SIZE][FILTER_SIZE];
__global__ void convolution(Matrix N, Matrix P){
__shared__ float Ns[TILE_SIZE + 5 - 1][TILE_SIZE + 5 -1];
int i, j;
float output = 0.0f;
int tx = threadIdx.x;
int ty = threadIdx.y;
int row_o = blockIdx.y * TILE_SIZE + ty;
int col_o = blockIdx.x * TILE_SIZE + tx;
int row_i = row_o - 2;
int col_i = col_o - 2;
if((row_i >= 0) && (row_i < N.height) && (col_i >= 0) && (col_i < N.width)){
Ns[ty][tx] = N.elements[row_i * N.width + col_i];
}
else{
Ns[ty][tx] = 0.0f;
}
__syncthreads();
if(ty < TILE_SIZE && tx < TILE_SIZE){
for(i = 0; i < 5; i++){
for(j = 0; j < 5; j++){
output += M_c[i][j] * Ns[i + ty][j + tx];
}
}
}
if(row_o < P.height && col_o < P.width){
P.elements[row_o * P.width + col_o] = output;
}
}
support.h:
#ifndef __FILEH__
#define __FILEH__
#include <sys/time.h>
typedef struct {
struct timeval startTime;
struct timeval endTime;
} Timer;
// Matrix Structure declaration
typedef struct {
unsigned int width;
unsigned int height;
unsigned int pitch;
float* elements;
} Matrix;
#define FILTER_SIZE 5
#define TILE_SIZE 12
#define BLOCK_SIZE (TILE_SIZE + FILTER_SIZE - 1)
Matrix allocateMatrix(unsigned height, unsigned width);
void initMatrix(Matrix mat);
Matrix allocateDeviceMatrix(unsigned height, unsigned width);
void copyToDeviceMatrix(Matrix dst, Matrix src);
void copyFromDeviceMatrix(Matrix dst, Matrix src);
void verify(Matrix M, Matrix N, Matrix P);
void freeMatrix(Matrix mat);
void freeDeviceMatrix(Matrix mat);
void startTime(Timer* timer);
void stopTime(Timer* timer);
float elapsedTime(Timer timer);
#define FATAL(msg, ...) \
do {\
fprintf(stderr, "[%s:%d] "msg"\n", __FILE__, __LINE__, ##__VA_ARGS__);\
exit(-1);\
} while(0)
#if __BYTE_ORDER != __LITTLE_ENDIAN
# error "File I/O is not implemented for this system: wrong endianness."
#endif
#endif
support.cu:
#include <stdlib.h>
#include <stdio.h>
#include "support.h"
Matrix allocateMatrix(unsigned height, unsigned width)
{
Matrix mat;
mat.height = height;
mat.width = mat.pitch = width;
mat.elements = (float*)malloc(height*width*sizeof(float));
if(mat.elements == NULL) FATAL("Unable to allocate host");
return mat;
}
void initMatrix(Matrix mat)
{
for (unsigned int i=0; i < mat.height*mat.width; i++) {
mat.elements[i] = (rand()%100)/100.00;
}
}
Matrix allocateDeviceMatrix(unsigned height, unsigned width)
{
Matrix mat;
cudaError_t cuda_ret;
mat.height = height;
mat.width = mat.pitch = width;
cuda_ret = cudaMalloc((void**)&(mat.elements), height*width*sizeof(float));
if(cuda_ret != cudaSuccess) FATAL("Unable to allocate device memory");
return mat;
}
void copyToDeviceMatrix(Matrix dst, Matrix src)
{
cudaError_t cuda_ret;
cuda_ret = cudaMemcpy(dst.elements, src.elements, src.height*src.width*sizeof(float), cudaMemcpyHostToDevice);
if(cuda_ret != cudaSuccess) FATAL("Unable to copy to device");
}
void copyFromDeviceMatrix(Matrix dst, Matrix src)
{
cudaError_t cuda_ret;
cuda_ret = cudaMemcpy(dst.elements, src.elements, src.height*src.width*sizeof(float), cudaMemcpyDeviceToHost);
if(cuda_ret != cudaSuccess) FATAL("Unable to copy from device");
}
void verify(Matrix M, Matrix N, Matrix P) {
const float relativeTolerance = 1e-6;
for(int row = 0; row < N.height; ++row) {
for(int col = 0; col < N.width; ++col) {
float sum = 0.0f;
for(int i = 0; i < M.height; ++i) {
for(int j = 0; j < M.width; ++j) {
int iN = row - M.height/2 + i;
int jN = col - M.width/2 + j;
if(iN >= 0 && iN < N.height && jN >= 0 && jN < N.width) {
sum += M.elements[i*M.width + j]*N.elements[iN*N.width + jN];
}
}
}
float relativeError = (sum - P.elements[row*P.width + col])/sum;
if (relativeError > relativeTolerance
|| relativeError < -relativeTolerance) {
printf("TEST FAILED\n\n");
exit(0);
}
}
}
printf("TEST PASSED\n\n");
}
void freeMatrix(Matrix mat)
{
free(mat.elements);
mat.elements = NULL;
}
void freeDeviceMatrix(Matrix mat)
{
cudaFree(mat.elements);
mat.elements = NULL;
}
void startTime(Timer* timer) {
gettimeofday(&(timer->startTime), NULL);
}
void stopTime(Timer* timer) {
gettimeofday(&(timer->endTime), NULL);
}
float elapsedTime(Timer timer) {
return ((float) ((timer.endTime.tv_sec - timer.startTime.tv_sec) \
+ (timer.endTime.tv_usec - timer.startTime.tv_usec)/1.0e6));
}
One set of problems is here:
cudaMemcpyToSymbol(M_h, M_c,FILTER_SIZE*sizeof(float));
If you ran your code with cuda-memcheck it would point you right at this line as being a problem.
The first parameter should be the destination symbol, i.e. M_c, and the second parameter should be the host source pointer, i.e. M_h.
Furthermore, shouldn't it be FILTER_SIZE*FILTER_SIZE ? Isn't the size of data you want to transfer equal to the dimension squared?
Finally, M_h is not a valid source pointer. You should use M_h.elements.
So something like this:
cudaMemcpyToSymbol(M_c, M_h.elements,FILTER_SIZE*FILTER_SIZE*sizeof(float));
I don't believe this fixes all the issues in your code. To continue the debug, I would print out one element in the GPU result that does not match your verify routine, and work through the arithmetic for that one element. Use printf in device code if that helps.
In the future, please run your code with cuda-memcheck before asking for help here. Even if you don't understand the output, it will be useful for those trying to help you.

CUDA n body shared memory does not speed up the computation

I am quite new in CUDA. I wrote a short code ONLY for testing the kernel for computing accelerations of mass particles. I test it using only time ./example. I have Kubuntu 12.04, Intel(R) Core(TM) i5 CPU 760 # 2.80GHz, GeForce GTX 560, and compile it by using nvcc -O3 -arch=sm_20 -o example example.cu. Here is my code.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
__global__ void acc_sh(double *x, double *y, double *z, double *ax, double *ay, double *az, double *mass, int N)
{
extern __shared__ double4 shPos[]; //make dynamic
int p = blockDim.x;
int idx = blockIdx.x*p + threadIdx.x;
if (idx > N-1) return;
double3 acc = (double3){0.0,0.0,0.0};
double posx = x[idx];
double posy = y[idx];
double posz = z[idx];
// Tile
for (int k = 0; k < N; k += p) {
//Load positions into shmem
shPos[threadIdx.x].x = x[k + threadIdx.x];
shPos[threadIdx.x].y = y[k + threadIdx.x];
shPos[threadIdx.x].z = z[k + threadIdx.x];
shPos[threadIdx.x].w = mass[k + threadIdx.x];
__syncthreads();
for (int j = 0; j < p && k + j < N; j++) {
//Loop over the shmem
double rijx = posx - shPos[j].x;
double rijy = posy - shPos[j].y;
double rijz = posz - shPos[j].z;
double dist = rijx*rijx + rijy*rijy + rijz*rijz;
double dist3 = dist*dist*dist;
double apre = 0.0;
if (dist3 != 0) //avoid self-interaction
{
apre = rsqrt(dist3)*shPos[j].w;
}
acc.x += apre*rijx;
acc.y += apre*rijy;
acc.z += apre*rijz;
}
__syncthreads();
}
ax[idx] = acc.x;
ay[idx] = acc.y;
az[idx] = acc.z;
}
__global__ void acc(double *x, double *y, double *z, double *ax, double *ay, double *az, double *mass, int N)
{
int p = blockDim.x;
int idx = blockIdx.x*p + threadIdx.x;
if (idx > N-1) return;
double3 acc = (double3){0.0,0.0,0.0};
double posx = x[idx];
double posy = y[idx];
double posz = z[idx];
// Do not use shmem and loop over all bodies
for (int k = 0; k < N; k++) {
double rijx = posx - x[k];
double rijy = posy - y[k];
double rijz = posz - y[k];
double dist = rijx*rijx + rijy*rijy + rijz*rijz;
double dist3 = dist*dist*dist;
double apre = 0.0;
if (dist3 != 0) //avoid self-interaction
{
apre = rsqrt(dist3)*mass[k];
}
acc.x += apre*rijx;
acc.y += apre*rijy;
acc.z += apre*rijz;
__syncthreads();
}
ax[idx] = acc.x;
ay[idx] = acc.y;
az[idx] = acc.z;
}
int main()
{
srand(time(NULL));
const int N = 16384;
double t, dt, tend;
//INIT TEST PARTICLES
// HOST
double *x, *y, *z, *mass;
double *ax, *ay, *az, *dmass;
//DEVICE
double *dx, *dy, *dz;
double *dax, *day, *daz;
double size = N*sizeof(double);
cudaMalloc((void**)&dx, size);
cudaMalloc((void**)&dy, size);
cudaMalloc((void**)&dz, size);
cudaMalloc((void**)&dmass, size);
cudaMalloc((void**)&dax, size);
cudaMalloc((void**)&day, size);
cudaMalloc((void**)&daz, size);
x = (double*) malloc(size);
y = (double*) malloc(size);
z = (double*) malloc(size);
mass = (double*) malloc(size);
ax = (double*) malloc(size);
ay = (double*) malloc(size);
az = (double*) malloc(size);
for (int i = 0; i < N; i++)
{
x[i] = (double) rand()/RAND_MAX;
y[i] = (double) rand()/RAND_MAX;
z[i] = (double) rand()/RAND_MAX;
mass[i] = (double) rand()/RAND_MAX;
// printf("%d %10.5e %10.5e %10.5e %10.5e \n", i, x[i], y[i], z[i], mass[i]);
ax[i] = 0;
ay[i] = 0;
az[i] = 0;
}
cudaMemcpy(dx, x, size, cudaMemcpyHostToDevice);
cudaMemcpy(dy, y, size, cudaMemcpyHostToDevice);
cudaMemcpy(dz, z, size, cudaMemcpyHostToDevice);
cudaMemcpy(dmass, mass, size, cudaMemcpyHostToDevice);
cudaMemcpy(dax, ax, size, cudaMemcpyHostToDevice);
cudaMemcpy(day, ay, size, cudaMemcpyHostToDevice);
cudaMemcpy(daz, az, size, cudaMemcpyHostToDevice);
t = 0.0; //start integ. time
tend = 365.0; //end integr. time, about one year
dt = 1.0;
int TPB = 128;
int BPG = (N/TPB)+1;
//********************************************************
//********************************************************
//********************************************************
//MAIN CYCLE**********************************************
//********************************************************
//********************************************************
//********************************************************
while (t <= tend) {
printf("time [d] %24.20f \n", t);
acc_sh<<< BPG, TPB, sizeof(double4)*TPB >>>(dx,dy,dz,dax,day,daz,dmass,N);
//acc<<< BPG, TPB >>>(dx,dy,dz,dax,day,daz,dmass,N);
t += dt;
}
cudaMemcpy(x, dx, size, cudaMemcpyDeviceToHost);
cudaMemcpy(y, dy, size, cudaMemcpyDeviceToHost);
cudaMemcpy(z, dz, size, cudaMemcpyDeviceToHost);
cudaMemcpy(ax, dax, size, cudaMemcpyDeviceToHost);
cudaMemcpy(ay, day, size, cudaMemcpyDeviceToHost);
cudaMemcpy(az, daz, size, cudaMemcpyDeviceToHost);
//********************************************************
//********************************************************
//********************************************************
//OUTPUT RESULTS******************************************
//********************************************************
//********************************************************
//********************************************************
/*for (int j = 0; j < N; j++) {
printf("%d %23.16e %23.16e %23.16e \n", j+1, ax[j], ay[j], az[j]);
}*/
cudaFree(dx);
cudaFree(dy);
cudaFree(dz);
cudaFree(ax);
cudaFree(ay);
cudaFree(az);
return 0;
}
When I run it and measure the total time of app running, I obtain these running times:
NO SHARED (in MAIN CYCLE only acc_sh is commented):
real 0m44.933s
user 0m32.838s
sys 0m12.001s
SHARED (in MAIN CYCLE only acc is commented):
real 0m44.259s
user 0m32.710s
sys 0m11.445s
Times are comparable! Why? I expected, that when I use acc_sh which uses shared memory, it should be faster... Next question is: why is the program at the beginning so fast, and at the tend it waits for "something"?
don't use a double quantity to specify the number of bytes to allocate or transfer:
double size = N*sizeof(double);
use int, unsigned, or size_t instead. When I compile your code, I see numerous warnings due to this.
You have a bug in your acc kernel code which will produce incorrect results and affect the timing:
double rijy = posy - y[k];
double rijz = posz - y[k];
^
that should be z[k], not y[k]
This coding error significantly reduces the amount of data that your non-shared kernel needs to load, which makes this kernel (incorrectly) perform better. If you had bothered to compare and check the results between the two cases, you would have found a discrepancy there as well.
When I fix those errors, on my particular setup, I get timings of ~21 seconds for the non-shared case, and ~18 seconds for the shared case.
If you're looking for 10x improvement going from global to shared memory, that's simply implausible. Shared memory bandwidth is only about 5x better than global memory bandwidth, so it's unreasonable to expect 10x even in a perfect case. Furthermore, this type of comparison discounts the effect of the L1 and L2 caches in your GPU, which can bring global memory accesses, for frequently accessed data, up to nearly the level of shared memory.
Regarding this question: "why is the program at the beginning so fast, and at the tend it waits for "something"?" The kernel launches are asynchronous. The kernel launch returns control to the host thread before the kernel begins executing. When you launch a kernel in a loop like this, it launches, and then immediately returns control to the host thread (before that kernel begins executing), which launches the next kernel.