multiple SVDs of a matrix using CUDA [duplicate] - cuda

I am using cusolverDnCgesvdjBatched function to calculate singular value decomposition (SVD) of multiple matrices, I use cuda-memcheck to check any memory issues, I am getting an error like this in the cusolverDnCgesvdjBatched function.
========= Invalid __global__ write of size 4
========= at 0x000062f8 in void batched_svd_parallel_jacobi_32x16<float2, float>(int, int, int, int, float2*, __int64, int, float*, float2*, __int64, int, float2*, __int64, int, float, int, int*, float, int, int*, int, float)
========= by thread (0,0,0) in block (4,0,0)
========= Address 0x701019010 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuModuleUnload.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x904ce) [0x2ae05e]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x92e73) [0x2b0a03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x84cb7]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x86e03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x11473a) [0x3322ca]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuModuleUnload + 0x1d6) [0x1d5d36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::module::unload + 0x115) [0x9535]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextState::unloadAllModules + 0x196) [0x9b36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextStateManager::destroyAllContextStatesOnRuntimeUnload + 0x78) [0xa188]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::globalState::~globalState + 0x3d) [0x24dd]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::set<cudart::globalModule * __ptr64>::rehash + 0x106) [0x74c6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x156) [0x142d6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x7b) [0x141fb]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x34) [0x141b4]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x142) [0x20522]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0xcb) [0x204ab]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x6e) [0x2044e]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (gpuErrchk + 0x4c) [0xf0dc]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (main + 0x3ef) [0xebaf]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (__scrt_common_main_seh + 0x10c) [0xf5c4]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.dll (BaseThreadInitThunk + 0x14) [0x17034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x52651]
=========
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuModuleUnload.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x904ce) [0x2ae05e]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x92e73) [0x2b0a03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x84cb7]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x86e03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x11473a) [0x3322ca]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuModuleUnload + 0x1d6) [0x1d5d36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::module::unload + 0x115) [0x9535]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextState::unloadAllModules + 0x196) [0x9b36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextStateManager::destroyAllContextStatesOnRuntimeUnload + 0x78) [0xa188]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::globalState::~globalState + 0x3d) [0x24dd]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::set<cudart::globalModule * __ptr64>::rehash + 0x106) [0x74c6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x156) [0x142d6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x7b) [0x141fb]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x34) [0x141b4]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x142) [0x20522]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0xcb) [0x204ab]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x6e) [0x2044e]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (gpuErrchk + 0x4c) [0xf0dc]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (main + 0x3ef) [0xebaf]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (__scrt_common_main_seh + 0x10c) [0xf5c4]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.dll (BaseThreadInitThunk + 0x14) [0x17034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x52651]
=========
========= ERROR SUMMARY: 8 errors
I am attaching the whole code I am using.
kernel.cu
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusolverDn.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
#define FULLSVD
#define PRINTRESULTS
/********/
/* MAIN */
/********/
int main() {
const int M = 10;
const int N = 5;
const int lda = M;
//const int numMatrices = 3;
const int numMatrices = 256;
TimingGPU timerGPU;
// --- Setting the host matrix
cuComplex *h_A = (cuComplex *)malloc(lda * N * numMatrices * sizeof(double));
for (unsigned int k = 0; k < numMatrices; k++)
for (unsigned int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
h_A[k * M * N + j * M + i] = make_float2((1. / (k + 1)) * (i + j * j) * (i + j), (1. / (k + 1)) * (i + j * j) * (i + j));
//printf("[%d, %d] %f\n", i, j, h_A[j*M + i]);
//printf("%f %f", h_A[j*M + i].x, h_A[j * M + i].y);
}
//printf("\n");
}
// --- Setting the device matrix and moving the host matrix to the device
cuComplex *d_A; gpuErrchk(cudaMalloc(&d_A, M * N * numMatrices * sizeof(cuComplex)));
gpuErrchk(cudaMemcpy(d_A, h_A, M * N * numMatrices * sizeof(cuComplex), cudaMemcpyHostToDevice));
// --- host side SVD results space
float *h_S = (float *)malloc(N * numMatrices * sizeof(float));
cuComplex *h_U = NULL;
cuComplex *h_V = NULL;
#ifdef FULLSVD
h_U = (cuComplex *)malloc(M * M * numMatrices * sizeof(cuComplex));
h_V = (cuComplex *)malloc(N * N * numMatrices * sizeof(cuComplex));
#endif
// --- device side SVD workspace and matrices
int work_size = 0;
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
float *d_S; gpuErrchk(cudaMalloc(&d_S, N * numMatrices * sizeof(float)));
cuComplex *d_U = NULL;
cuComplex *d_V = NULL;
#ifdef FULLSVD
gpuErrchk(cudaMalloc(&d_U, M * M * numMatrices * sizeof(cuComplex)));
gpuErrchk(cudaMalloc(&d_V, N * N * numMatrices * sizeof(cuComplex)));
#endif
cuComplex *d_work = NULL; /* devie workspace for gesvdj */
int devInfo_h = 0; /* host copy of error devInfo_h */
// --- Parameters configuration of Jacobi-based SVD
const double tol = 1.e-7;
const int maxSweeps = 15;
cusolverEigMode_t jobz; // --- CUSOLVER_EIG_MODE_VECTOR - Compute eigenvectors; CUSOLVER_EIG_MODE_NOVECTOR - Compute singular values only
#ifdef FULLSVD
jobz = CUSOLVER_EIG_MODE_VECTOR;
#else
jobz = CUSOLVER_EIG_MODE_NOVECTOR;
#endif
const int econ = 0; // --- econ = 1 for economy size
// --- Numerical result parameters of gesvdj
double residual = 0;
int executedSweeps = 0;
// --- CUDA solver initialization
cusolverDnHandle_t solver_handle = NULL;
cusolveSafeCall(cusolverDnCreate(&solver_handle));
// --- Configuration of gesvdj
gesvdjInfo_t gesvdj_params = NULL;
cusolveSafeCall(cusolverDnCreateGesvdjInfo(&gesvdj_params));
// --- Set the computation tolerance, since the default tolerance is machine precision
cusolveSafeCall(cusolverDnXgesvdjSetTolerance(gesvdj_params, tol));
// --- Set the maximum number of sweeps, since the default value of max. sweeps is 100
cusolveSafeCall(cusolverDnXgesvdjSetMaxSweeps(gesvdj_params, maxSweeps));
// --- Query the SVD workspace
cusolveSafeCall(cusolverDnCgesvdjBatched_bufferSize(
solver_handle,
jobz, // --- Compute the singular vectors or not
M, // --- Number of rows of A, 0 <= M
N, // --- Number of columns of A, 0 <= N
d_A, // --- M x N
lda, // --- Leading dimension of A
d_S, // --- Square matrix of size min(M, N) x min(M, N)
d_U, // --- M x M if econ = 0, M x min(M, N) if econ = 1
lda, // --- Leading dimension of U, ldu >= max(1, M)
d_V, // --- N x N if econ = 0, N x min(M,N) if econ = 1
lda, // --- Leading dimension of V, ldv >= max(1, N)
&work_size,
gesvdj_params,
numMatrices));
gpuErrchk(cudaMalloc(&d_work, sizeof(cuComplex) * work_size));
// --- Compute SVD
timerGPU.StartCounter();
cusolveSafeCall(cusolverDnCgesvdjBatched(
solver_handle,
jobz, // --- Compute the singular vectors or not
M, // --- Number of rows of A, 0 <= M
N, // --- Number of columns of A, 0 <= N
d_A, // --- M x N
lda, // --- Leading dimension of A
d_S, // --- Square matrix of size min(M, N) x min(M, N)
d_U, // --- M x M if econ = 0, M x min(M, N) if econ = 1
lda, // --- Leading dimension of U, ldu >= max(1, M)
d_V, // --- N x N if econ = 0, N x min(M, N) if econ = 1
N, // --- Leading dimension of V, ldv >= max(1, N)
d_work,
work_size,
devInfo,
gesvdj_params,
numMatrices));
printf("Calculation of the singular values only: %f ms\n\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_S, d_S, sizeof(float) * N * numMatrices, cudaMemcpyDeviceToHost));
#ifdef FULLSVD
gpuErrchk(cudaMemcpy(h_U, d_U, sizeof(cuComplex) * lda * M * numMatrices, cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_V, d_V, sizeof(cuComplex) * N * N * numMatrices, cudaMemcpyDeviceToHost));
#endif
#ifdef PRINTRESULTS
printf("SINGULAR VALUES \n");
printf("_______________ \n");
for (int k = 0; k < numMatrices; k++)
{
for (int p = 0; p < N; p++)
printf("Matrix nr. %d; SV nr. %d; Value = %f\n", k, p, h_S[k * N + p]);
printf("\n");
}
#if 0 //FULLSVD
printf("SINGULAR VECTORS U \n");
printf("__________________ \n");
for (int k = 0; k < numMatrices; k++)
{
for (int q = 0; q < (1 - econ) * M + econ * min(M, N); q++)
for (int p = 0; p < M; p++)
printf("Matrix nr. %d; U nr. %d; Value = %f\n", k, p, h_U[((1 - econ) * M + econ * min(M, N)) * M * k + q * M + p]);
printf("\n");
}
printf("SINGULAR VECTORS V \n");
printf("__________________ \n");
for (int k = 0; k < numMatrices; k++)
{
for (int q = 0; q < (1 - econ) * N + econ * min(M, N); q++)
for (int p = 0; p < N; p++)
printf("Matrix nr. %d; V nr. %d; Value = %f\n", k, p, h_V[((1 - econ) * N + econ * min(M, N)) * N * k + q * N + p]);
printf("\n");
}
#endif
#endif
if (0 == devInfo_h)
{
printf("gesvdj converges \n");
}
else if (0 > devInfo_h)
{
printf("%d-th parameter is wrong \n", -devInfo_h);
exit(1);
}
else
{
printf("WARNING: devInfo_h = %d : gesvdj does not converge \n", devInfo_h);
}
// --- Free resources
if (d_A) gpuErrchk(cudaFree(d_A));
if (d_S) gpuErrchk(cudaFree(d_S));
#ifdef FULLSVD
if (d_U) gpuErrchk(cudaFree(d_U));
if (d_V) gpuErrchk(cudaFree(d_V));
#endif
if (devInfo) gpuErrchk(cudaFree(devInfo));
if (d_work) gpuErrchk(cudaFree(d_work));
if (solver_handle) cusolveSafeCall(cusolverDnDestroy(solver_handle));
if (gesvdj_params) cusolveSafeCall(cusolverDnDestroyGesvdjInfo(gesvdj_params));
gpuErrchk(cudaDeviceReset());
return 0;
}
TimingCPU.cpp
/* TIMING CPU */
/**************/
#include "TimingCPU.h"
#ifdef __linux__
#include <sys/time.h>
#include <stdio.h>
TimingCPU::TimingCPU() : cur_time_(0) {
StartCounter();
}
TimingCPU::~TimingCPU() { }
void TimingCPU::StartCounter()
{
struct timeval time;
if (gettimeofday(&time, 0)) return;
cur_time_ = 1000000 * time.tv_sec + time.tv_usec;
}
double TimingCPU::GetCounter()
{
struct timeval time;
if (gettimeofday(&time, 0)) return -1;
long cur_time = 1000000 * time.tv_sec + time.tv_usec;
double sec = (cur_time - cur_time_) / 1000000.0;
if (sec < 0) sec += 86400;
cur_time_ = cur_time;
return 1000. * sec;
}
#elif _WIN32 || _WIN64
#include <windows.h>
#include <iostream>
struct PrivateTimingCPU {
double PCFreq;
__int64 CounterStart;
};
// --- Default constructor
TimingCPU::TimingCPU() {
privateTimingCPU = new PrivateTimingCPU; (*privateTimingCPU).PCFreq = 0.0; (*privateTimingCPU).CounterStart = 0;
}
// --- Default destructor
TimingCPU::~TimingCPU() { }
// --- Starts the timing
void TimingCPU::StartCounter()
{
LARGE_INTEGER li;
if (!QueryPerformanceFrequency(&li)) std::cout << "QueryPerformanceFrequency failed!\n";
(*privateTimingCPU).PCFreq = double(li.QuadPart) / 1000.0;
QueryPerformanceCounter(&li);
(*privateTimingCPU).CounterStart = li.QuadPart;
}
// --- Gets the timing counter in ms
double TimingCPU::GetCounter()
{
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return double(li.QuadPart - (*privateTimingCPU).CounterStart) / (*privateTimingCPU).PCFreq;
}
#endif
TimingCPU.h
// 1 micro-second accuracy
// Returns the time in seconds
#ifndef __TIMINGCPU_H__
#define __TIMINGCPU_H__
#ifdef __linux__
class TimingCPU {
private:
long cur_time_;
public:
TimingCPU();
~TimingCPU();
void StartCounter();
double GetCounter();
};
#elif _WIN32 || _WIN64
struct PrivateTimingCPU;
class TimingCPU
{
private:
PrivateTimingCPU *privateTimingCPU;
public:
TimingCPU();
~TimingCPU();
void StartCounter();
double GetCounter();
}; // TimingCPU class
#endif
#endif
TimingGPU.cu
/**************/
/* TIMING GPU */
/**************/
#include "TimingGPU.cuh"
#include <cuda.h>
#include <cuda_runtime.h>
struct PrivateTimingGPU {
cudaEvent_t start;
cudaEvent_t stop;
};
// default constructor
TimingGPU::TimingGPU() {
privateTimingGPU = new PrivateTimingGPU;
}
// default destructor
TimingGPU::~TimingGPU() { }
void TimingGPU::StartCounter()
{
cudaEventCreate(&((*privateTimingGPU).start));
cudaEventCreate(&((*privateTimingGPU).stop));
cudaEventRecord((*privateTimingGPU).start, 0);
}
void TimingGPU::StartCounterFlags()
{
int eventflags = cudaEventBlockingSync;
cudaEventCreateWithFlags(&((*privateTimingGPU).start), eventflags);
cudaEventCreateWithFlags(&((*privateTimingGPU).stop), eventflags);
cudaEventRecord((*privateTimingGPU).start, 0);
}
// Gets the counter in ms
float TimingGPU::GetCounter()
{
float time;
cudaEventRecord((*privateTimingGPU).stop, 0);
cudaEventSynchronize((*privateTimingGPU).stop);
cudaEventElapsedTime(&time, (*privateTimingGPU).start, (*privateTimingGPU).stop);
return time;
}
TimingGPU.cuh
#ifndef __TIMING_CUH__
#define __TIMING_CUH__
/**************/
/* TIMING GPU */
/**************/
// Events are a part of CUDA API and provide a system independent way to measure execution times on CUDA devices with approximately 0.5
// microsecond precision.
struct PrivateTimingGPU;
class TimingGPU
{
private:
PrivateTimingGPU *privateTimingGPU;
public:
TimingGPU();
~TimingGPU();
void StartCounter();
void StartCounterFlags();
float GetCounter();
}; // TimingCPU class
#endif
Utilities.cu
#include <assert.h>
#include "cuda_runtime.h"
#include <cuda.h>
#include <cusolverDn.h>
/*******************/
/* iDivUp FUNCTION */
/*******************/
extern "C" int iDivUp(int a, int b) {
return ((a % b) != 0) ? (a / b + 1) : (a / b);
}
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) {
exit(code);
}
}
}
extern "C" void gpuErrchk(cudaError_t ans) {
gpuAssert((ans), __FILE__, __LINE__);
}
/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cudaGetErrorEnum(cusolverStatus_t error)
{
switch (error)
{
case CUSOLVER_STATUS_SUCCESS:
return "CUSOLVER_SUCCESS";
case CUSOLVER_STATUS_NOT_INITIALIZED:
return "CUSOLVER_STATUS_NOT_INITIALIZED";
case CUSOLVER_STATUS_ALLOC_FAILED:
return "CUSOLVER_STATUS_ALLOC_FAILED";
case CUSOLVER_STATUS_INVALID_VALUE:
return "CUSOLVER_STATUS_INVALID_VALUE";
case CUSOLVER_STATUS_ARCH_MISMATCH:
return "CUSOLVER_STATUS_ARCH_MISMATCH";
case CUSOLVER_STATUS_EXECUTION_FAILED:
return "CUSOLVER_STATUS_EXECUTION_FAILED";
case CUSOLVER_STATUS_INTERNAL_ERROR:
return "CUSOLVER_STATUS_INTERNAL_ERROR";
case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
}
return "<unknown>";
}
inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
if (CUSOLVER_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSOLVE error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n", __FILE__, __LINE__, err, \
_cudaGetErrorEnum(err)); \
cudaDeviceReset(); assert(0); \
}
}
extern "C" void cusolveSafeCall(cusolverStatus_t err) {
__cusolveSafeCall(err, __FILE__, __LINE__);
}
Utilities.cuh
#ifndef UTILITIES_CUH
#define UTILITIES_CUH
extern "C" int iDivUp(int, int);
extern "C" void gpuErrchk(cudaError_t);
extern "C" void cusolveSafeCall(cusolverStatus_t);
#ifndef DEVICE_RESET
#define DEVICE_RESET cudaDeviceReset();
#endif
template< typename T >
void check(T result, char const *const func, const char *const file, int const line)
{
if (result)
{
fprintf(stderr, "CUDA error at %s:%d code=%d(%s) \"%s\" \n",
file, line);
//fprintf(stderr, "CUDA error at %s:%d code=%d(%s) \"%s\" \n",
// file, line, static_cast<unsigned int>(result), _cudaGetErrorEnum(result), func);
DEVICE_RESET
// Make sure we call CUDA Device Reset before exiting
exit(EXIT_FAILURE);
}
}
// This will output the proper CUDA error strings in the event that a CUDA host call returns an error
#define checkCudaErrors(val) check ( (val), #val, __FILE__, __LINE__ )
// This will output the proper error string when calling cudaGetLastError
#define getLastCudaError(msg) __getLastCudaError (msg, __FILE__, __LINE__)
#ifndef MAX
#define MAX(a,b) (a > b ? a : b)
#endif
#endif
Could anyone suggest fixes for the errors I am getting in svd function and the errors after that.

Referring to the documentation, for the info parameter:
info device output an integer array of dimension batchSize
So this is expected to be an array of integers of size equal to the number of matrices in the batch. This makes sense because we expect one of these info reports for each matrix. But your allocation does not do that:
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
When I fix that:
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int) * numMatrices));
the error goes away for me. This also has implications for your host-side allocation and also the copying of this data from device to host, later in the code.

Related

cuda-memcheck error in cusolverDnCgesvdjBatched function using CUDA

I am using cusolverDnCgesvdjBatched function to calculate singular value decomposition (SVD) of multiple matrices, I use cuda-memcheck to check any memory issues, I am getting an error like this in the cusolverDnCgesvdjBatched function.
========= Invalid __global__ write of size 4
========= at 0x000062f8 in void batched_svd_parallel_jacobi_32x16<float2, float>(int, int, int, int, float2*, __int64, int, float*, float2*, __int64, int, float2*, __int64, int, float, int, int*, float, int, int*, int, float)
========= by thread (0,0,0) in block (4,0,0)
========= Address 0x701019010 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuModuleUnload.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x904ce) [0x2ae05e]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x92e73) [0x2b0a03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x84cb7]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x86e03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x11473a) [0x3322ca]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuModuleUnload + 0x1d6) [0x1d5d36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::module::unload + 0x115) [0x9535]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextState::unloadAllModules + 0x196) [0x9b36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextStateManager::destroyAllContextStatesOnRuntimeUnload + 0x78) [0xa188]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::globalState::~globalState + 0x3d) [0x24dd]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::set<cudart::globalModule * __ptr64>::rehash + 0x106) [0x74c6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x156) [0x142d6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x7b) [0x141fb]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x34) [0x141b4]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x142) [0x20522]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0xcb) [0x204ab]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x6e) [0x2044e]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (gpuErrchk + 0x4c) [0xf0dc]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (main + 0x3ef) [0xebaf]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (__scrt_common_main_seh + 0x10c) [0xf5c4]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.dll (BaseThreadInitThunk + 0x14) [0x17034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x52651]
=========
========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to "unspecified launch failure" on CUDA API call to cuModuleUnload.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x904ce) [0x2ae05e]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x92e73) [0x2b0a03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x84cb7]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll [0x86e03]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuProfilerStop + 0x11473a) [0x3322ca]
========= Host Frame:C:\WINDOWS\system32\DriverStore\FileRepository\nvami.inf_amd64_72390dc4652f28fa\nvcuda64.dll (cuModuleUnload + 0x1d6) [0x1d5d36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::module::unload + 0x115) [0x9535]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextState::unloadAllModules + 0x196) [0x9b36]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::contextStateManager::destroyAllContextStatesOnRuntimeUnload + 0x78) [0xa188]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::globalState::~globalState + 0x3d) [0x24dd]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (cudart::set<cudart::globalModule * __ptr64>::rehash + 0x106) [0x74c6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x156) [0x142d6]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x7b) [0x141fb]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (execute_onexit_table + 0x34) [0x141b4]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x142) [0x20522]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0xcb) [0x204ab]
========= Host Frame:C:\WINDOWS\System32\ucrtbase.dll (exit + 0x6e) [0x2044e]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (gpuErrchk + 0x4c) [0xf0dc]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (main + 0x3ef) [0xebaf]
========= Host Frame:D:\SVD\x64\Release\SVD.exe (__scrt_common_main_seh + 0x10c) [0xf5c4]
========= Host Frame:C:\WINDOWS\System32\KERNEL32.dll (BaseThreadInitThunk + 0x14) [0x17034]
========= Host Frame:C:\WINDOWS\SYSTEM32\ntdll.dll (RtlUserThreadStart + 0x21) [0x52651]
=========
========= ERROR SUMMARY: 8 errors
I am attaching the whole code I am using.
kernel.cu
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusolverDn.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
#define FULLSVD
#define PRINTRESULTS
/********/
/* MAIN */
/********/
int main() {
const int M = 10;
const int N = 5;
const int lda = M;
//const int numMatrices = 3;
const int numMatrices = 256;
TimingGPU timerGPU;
// --- Setting the host matrix
cuComplex *h_A = (cuComplex *)malloc(lda * N * numMatrices * sizeof(double));
for (unsigned int k = 0; k < numMatrices; k++)
for (unsigned int i = 0; i < M; i++)
{
for (unsigned int j = 0; j < N; j++)
{
h_A[k * M * N + j * M + i] = make_float2((1. / (k + 1)) * (i + j * j) * (i + j), (1. / (k + 1)) * (i + j * j) * (i + j));
//printf("[%d, %d] %f\n", i, j, h_A[j*M + i]);
//printf("%f %f", h_A[j*M + i].x, h_A[j * M + i].y);
}
//printf("\n");
}
// --- Setting the device matrix and moving the host matrix to the device
cuComplex *d_A; gpuErrchk(cudaMalloc(&d_A, M * N * numMatrices * sizeof(cuComplex)));
gpuErrchk(cudaMemcpy(d_A, h_A, M * N * numMatrices * sizeof(cuComplex), cudaMemcpyHostToDevice));
// --- host side SVD results space
float *h_S = (float *)malloc(N * numMatrices * sizeof(float));
cuComplex *h_U = NULL;
cuComplex *h_V = NULL;
#ifdef FULLSVD
h_U = (cuComplex *)malloc(M * M * numMatrices * sizeof(cuComplex));
h_V = (cuComplex *)malloc(N * N * numMatrices * sizeof(cuComplex));
#endif
// --- device side SVD workspace and matrices
int work_size = 0;
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
float *d_S; gpuErrchk(cudaMalloc(&d_S, N * numMatrices * sizeof(float)));
cuComplex *d_U = NULL;
cuComplex *d_V = NULL;
#ifdef FULLSVD
gpuErrchk(cudaMalloc(&d_U, M * M * numMatrices * sizeof(cuComplex)));
gpuErrchk(cudaMalloc(&d_V, N * N * numMatrices * sizeof(cuComplex)));
#endif
cuComplex *d_work = NULL; /* devie workspace for gesvdj */
int devInfo_h = 0; /* host copy of error devInfo_h */
// --- Parameters configuration of Jacobi-based SVD
const double tol = 1.e-7;
const int maxSweeps = 15;
cusolverEigMode_t jobz; // --- CUSOLVER_EIG_MODE_VECTOR - Compute eigenvectors; CUSOLVER_EIG_MODE_NOVECTOR - Compute singular values only
#ifdef FULLSVD
jobz = CUSOLVER_EIG_MODE_VECTOR;
#else
jobz = CUSOLVER_EIG_MODE_NOVECTOR;
#endif
const int econ = 0; // --- econ = 1 for economy size
// --- Numerical result parameters of gesvdj
double residual = 0;
int executedSweeps = 0;
// --- CUDA solver initialization
cusolverDnHandle_t solver_handle = NULL;
cusolveSafeCall(cusolverDnCreate(&solver_handle));
// --- Configuration of gesvdj
gesvdjInfo_t gesvdj_params = NULL;
cusolveSafeCall(cusolverDnCreateGesvdjInfo(&gesvdj_params));
// --- Set the computation tolerance, since the default tolerance is machine precision
cusolveSafeCall(cusolverDnXgesvdjSetTolerance(gesvdj_params, tol));
// --- Set the maximum number of sweeps, since the default value of max. sweeps is 100
cusolveSafeCall(cusolverDnXgesvdjSetMaxSweeps(gesvdj_params, maxSweeps));
// --- Query the SVD workspace
cusolveSafeCall(cusolverDnCgesvdjBatched_bufferSize(
solver_handle,
jobz, // --- Compute the singular vectors or not
M, // --- Number of rows of A, 0 <= M
N, // --- Number of columns of A, 0 <= N
d_A, // --- M x N
lda, // --- Leading dimension of A
d_S, // --- Square matrix of size min(M, N) x min(M, N)
d_U, // --- M x M if econ = 0, M x min(M, N) if econ = 1
lda, // --- Leading dimension of U, ldu >= max(1, M)
d_V, // --- N x N if econ = 0, N x min(M,N) if econ = 1
lda, // --- Leading dimension of V, ldv >= max(1, N)
&work_size,
gesvdj_params,
numMatrices));
gpuErrchk(cudaMalloc(&d_work, sizeof(cuComplex) * work_size));
// --- Compute SVD
timerGPU.StartCounter();
cusolveSafeCall(cusolverDnCgesvdjBatched(
solver_handle,
jobz, // --- Compute the singular vectors or not
M, // --- Number of rows of A, 0 <= M
N, // --- Number of columns of A, 0 <= N
d_A, // --- M x N
lda, // --- Leading dimension of A
d_S, // --- Square matrix of size min(M, N) x min(M, N)
d_U, // --- M x M if econ = 0, M x min(M, N) if econ = 1
lda, // --- Leading dimension of U, ldu >= max(1, M)
d_V, // --- N x N if econ = 0, N x min(M, N) if econ = 1
N, // --- Leading dimension of V, ldv >= max(1, N)
d_work,
work_size,
devInfo,
gesvdj_params,
numMatrices));
printf("Calculation of the singular values only: %f ms\n\n", timerGPU.GetCounter());
gpuErrchk(cudaMemcpy(&devInfo_h, devInfo, sizeof(int), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_S, d_S, sizeof(float) * N * numMatrices, cudaMemcpyDeviceToHost));
#ifdef FULLSVD
gpuErrchk(cudaMemcpy(h_U, d_U, sizeof(cuComplex) * lda * M * numMatrices, cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_V, d_V, sizeof(cuComplex) * N * N * numMatrices, cudaMemcpyDeviceToHost));
#endif
#ifdef PRINTRESULTS
printf("SINGULAR VALUES \n");
printf("_______________ \n");
for (int k = 0; k < numMatrices; k++)
{
for (int p = 0; p < N; p++)
printf("Matrix nr. %d; SV nr. %d; Value = %f\n", k, p, h_S[k * N + p]);
printf("\n");
}
#if 0 //FULLSVD
printf("SINGULAR VECTORS U \n");
printf("__________________ \n");
for (int k = 0; k < numMatrices; k++)
{
for (int q = 0; q < (1 - econ) * M + econ * min(M, N); q++)
for (int p = 0; p < M; p++)
printf("Matrix nr. %d; U nr. %d; Value = %f\n", k, p, h_U[((1 - econ) * M + econ * min(M, N)) * M * k + q * M + p]);
printf("\n");
}
printf("SINGULAR VECTORS V \n");
printf("__________________ \n");
for (int k = 0; k < numMatrices; k++)
{
for (int q = 0; q < (1 - econ) * N + econ * min(M, N); q++)
for (int p = 0; p < N; p++)
printf("Matrix nr. %d; V nr. %d; Value = %f\n", k, p, h_V[((1 - econ) * N + econ * min(M, N)) * N * k + q * N + p]);
printf("\n");
}
#endif
#endif
if (0 == devInfo_h)
{
printf("gesvdj converges \n");
}
else if (0 > devInfo_h)
{
printf("%d-th parameter is wrong \n", -devInfo_h);
exit(1);
}
else
{
printf("WARNING: devInfo_h = %d : gesvdj does not converge \n", devInfo_h);
}
// --- Free resources
if (d_A) gpuErrchk(cudaFree(d_A));
if (d_S) gpuErrchk(cudaFree(d_S));
#ifdef FULLSVD
if (d_U) gpuErrchk(cudaFree(d_U));
if (d_V) gpuErrchk(cudaFree(d_V));
#endif
if (devInfo) gpuErrchk(cudaFree(devInfo));
if (d_work) gpuErrchk(cudaFree(d_work));
if (solver_handle) cusolveSafeCall(cusolverDnDestroy(solver_handle));
if (gesvdj_params) cusolveSafeCall(cusolverDnDestroyGesvdjInfo(gesvdj_params));
gpuErrchk(cudaDeviceReset());
return 0;
}
TimingCPU.cpp
/* TIMING CPU */
/**************/
#include "TimingCPU.h"
#ifdef __linux__
#include <sys/time.h>
#include <stdio.h>
TimingCPU::TimingCPU() : cur_time_(0) {
StartCounter();
}
TimingCPU::~TimingCPU() { }
void TimingCPU::StartCounter()
{
struct timeval time;
if (gettimeofday(&time, 0)) return;
cur_time_ = 1000000 * time.tv_sec + time.tv_usec;
}
double TimingCPU::GetCounter()
{
struct timeval time;
if (gettimeofday(&time, 0)) return -1;
long cur_time = 1000000 * time.tv_sec + time.tv_usec;
double sec = (cur_time - cur_time_) / 1000000.0;
if (sec < 0) sec += 86400;
cur_time_ = cur_time;
return 1000. * sec;
}
#elif _WIN32 || _WIN64
#include <windows.h>
#include <iostream>
struct PrivateTimingCPU {
double PCFreq;
__int64 CounterStart;
};
// --- Default constructor
TimingCPU::TimingCPU() {
privateTimingCPU = new PrivateTimingCPU; (*privateTimingCPU).PCFreq = 0.0; (*privateTimingCPU).CounterStart = 0;
}
// --- Default destructor
TimingCPU::~TimingCPU() { }
// --- Starts the timing
void TimingCPU::StartCounter()
{
LARGE_INTEGER li;
if (!QueryPerformanceFrequency(&li)) std::cout << "QueryPerformanceFrequency failed!\n";
(*privateTimingCPU).PCFreq = double(li.QuadPart) / 1000.0;
QueryPerformanceCounter(&li);
(*privateTimingCPU).CounterStart = li.QuadPart;
}
// --- Gets the timing counter in ms
double TimingCPU::GetCounter()
{
LARGE_INTEGER li;
QueryPerformanceCounter(&li);
return double(li.QuadPart - (*privateTimingCPU).CounterStart) / (*privateTimingCPU).PCFreq;
}
#endif
TimingCPU.h
// 1 micro-second accuracy
// Returns the time in seconds
#ifndef __TIMINGCPU_H__
#define __TIMINGCPU_H__
#ifdef __linux__
class TimingCPU {
private:
long cur_time_;
public:
TimingCPU();
~TimingCPU();
void StartCounter();
double GetCounter();
};
#elif _WIN32 || _WIN64
struct PrivateTimingCPU;
class TimingCPU
{
private:
PrivateTimingCPU *privateTimingCPU;
public:
TimingCPU();
~TimingCPU();
void StartCounter();
double GetCounter();
}; // TimingCPU class
#endif
#endif
TimingGPU.cu
/**************/
/* TIMING GPU */
/**************/
#include "TimingGPU.cuh"
#include <cuda.h>
#include <cuda_runtime.h>
struct PrivateTimingGPU {
cudaEvent_t start;
cudaEvent_t stop;
};
// default constructor
TimingGPU::TimingGPU() {
privateTimingGPU = new PrivateTimingGPU;
}
// default destructor
TimingGPU::~TimingGPU() { }
void TimingGPU::StartCounter()
{
cudaEventCreate(&((*privateTimingGPU).start));
cudaEventCreate(&((*privateTimingGPU).stop));
cudaEventRecord((*privateTimingGPU).start, 0);
}
void TimingGPU::StartCounterFlags()
{
int eventflags = cudaEventBlockingSync;
cudaEventCreateWithFlags(&((*privateTimingGPU).start), eventflags);
cudaEventCreateWithFlags(&((*privateTimingGPU).stop), eventflags);
cudaEventRecord((*privateTimingGPU).start, 0);
}
// Gets the counter in ms
float TimingGPU::GetCounter()
{
float time;
cudaEventRecord((*privateTimingGPU).stop, 0);
cudaEventSynchronize((*privateTimingGPU).stop);
cudaEventElapsedTime(&time, (*privateTimingGPU).start, (*privateTimingGPU).stop);
return time;
}
TimingGPU.cuh
#ifndef __TIMING_CUH__
#define __TIMING_CUH__
/**************/
/* TIMING GPU */
/**************/
// Events are a part of CUDA API and provide a system independent way to measure execution times on CUDA devices with approximately 0.5
// microsecond precision.
struct PrivateTimingGPU;
class TimingGPU
{
private:
PrivateTimingGPU *privateTimingGPU;
public:
TimingGPU();
~TimingGPU();
void StartCounter();
void StartCounterFlags();
float GetCounter();
}; // TimingCPU class
#endif
Utilities.cu
#include <assert.h>
#include "cuda_runtime.h"
#include <cuda.h>
#include <cusolverDn.h>
/*******************/
/* iDivUp FUNCTION */
/*******************/
extern "C" int iDivUp(int a, int b) {
return ((a % b) != 0) ? (a / b + 1) : (a / b);
}
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) {
exit(code);
}
}
}
extern "C" void gpuErrchk(cudaError_t ans) {
gpuAssert((ans), __FILE__, __LINE__);
}
/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cudaGetErrorEnum(cusolverStatus_t error)
{
switch (error)
{
case CUSOLVER_STATUS_SUCCESS:
return "CUSOLVER_SUCCESS";
case CUSOLVER_STATUS_NOT_INITIALIZED:
return "CUSOLVER_STATUS_NOT_INITIALIZED";
case CUSOLVER_STATUS_ALLOC_FAILED:
return "CUSOLVER_STATUS_ALLOC_FAILED";
case CUSOLVER_STATUS_INVALID_VALUE:
return "CUSOLVER_STATUS_INVALID_VALUE";
case CUSOLVER_STATUS_ARCH_MISMATCH:
return "CUSOLVER_STATUS_ARCH_MISMATCH";
case CUSOLVER_STATUS_EXECUTION_FAILED:
return "CUSOLVER_STATUS_EXECUTION_FAILED";
case CUSOLVER_STATUS_INTERNAL_ERROR:
return "CUSOLVER_STATUS_INTERNAL_ERROR";
case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
}
return "<unknown>";
}
inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
if (CUSOLVER_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSOLVE error in file '%s', line %d\n %s\nerror %d: %s\nterminating!\n", __FILE__, __LINE__, err, \
_cudaGetErrorEnum(err)); \
cudaDeviceReset(); assert(0); \
}
}
extern "C" void cusolveSafeCall(cusolverStatus_t err) {
__cusolveSafeCall(err, __FILE__, __LINE__);
}
Utilities.cuh
#ifndef UTILITIES_CUH
#define UTILITIES_CUH
extern "C" int iDivUp(int, int);
extern "C" void gpuErrchk(cudaError_t);
extern "C" void cusolveSafeCall(cusolverStatus_t);
#ifndef DEVICE_RESET
#define DEVICE_RESET cudaDeviceReset();
#endif
template< typename T >
void check(T result, char const *const func, const char *const file, int const line)
{
if (result)
{
fprintf(stderr, "CUDA error at %s:%d code=%d(%s) \"%s\" \n",
file, line);
//fprintf(stderr, "CUDA error at %s:%d code=%d(%s) \"%s\" \n",
// file, line, static_cast<unsigned int>(result), _cudaGetErrorEnum(result), func);
DEVICE_RESET
// Make sure we call CUDA Device Reset before exiting
exit(EXIT_FAILURE);
}
}
// This will output the proper CUDA error strings in the event that a CUDA host call returns an error
#define checkCudaErrors(val) check ( (val), #val, __FILE__, __LINE__ )
// This will output the proper error string when calling cudaGetLastError
#define getLastCudaError(msg) __getLastCudaError (msg, __FILE__, __LINE__)
#ifndef MAX
#define MAX(a,b) (a > b ? a : b)
#endif
#endif
Could anyone suggest fixes for the errors I am getting in svd function and the errors after that.
Referring to the documentation, for the info parameter:
info device output an integer array of dimension batchSize
So this is expected to be an array of integers of size equal to the number of matrices in the batch. This makes sense because we expect one of these info reports for each matrix. But your allocation does not do that:
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int)));
When I fix that:
int *devInfo; gpuErrchk(cudaMalloc(&devInfo, sizeof(int) * numMatrices));
the error goes away for me. This also has implications for your host-side allocation and also the copying of this data from device to host, later in the code.

CUDA matrixMulCUBLAS extra memcpy call

While playing with CUBLAS matrix multiplication sample I realised that nvprof profiler shows an extra call of cudaMemcpy Host to Device.
While 2 appear in source code, 3 actual calls are issued.
Why would that be? Is it an intrinsic effect of using CUBLAS?
Code from CUDA CUBLAS sample:
compiled with flags: -lcublas -I/usr/local/cuda-7.5/samples/common/inc
//////////////////////////////////////////////////////////////////////////
// Utilities and system includes
#include <assert.h>
#include <helper_string.h> // helper for shared functions common to CUDA Samples
// CUDA runtime
#include <cuda_runtime.h>
#include <cublas_v2.h>
// CUDA and CUBLAS functions
#include <helper_functions.h>
#include <helper_cuda.h>
#ifndef min
#define min(a,b) ((a < b) ? a : b)
#endif
#ifndef max
#define max(a,b) ((a > b) ? a : b)
#endif
typedef struct _matrixSize // Optional Command-line multiplier for matrix sizes
{
unsigned int uiWA, uiHA, uiWB, uiHB, uiWC, uiHC;
} sMatrixSize;
////////////////////////////////////////////////////////////////////////////////
//! Compute reference data set matrix multiply on CPU
//! C = A * B
//! #param C reference data, computed but preallocated
//! #param A matrix A as provided to device
//! #param B matrix B as provided to device
//! #param hA height of matrix A
//! #param wB width of matrix B
////////////////////////////////////////////////////////////////////////////////
void
matrixMulCPU(float *C, const float *A, const float *B, unsigned int hA, unsigned int wA, unsigned int wB)
{
for (unsigned int i = 0; i < hA; ++i)
for (unsigned int j = 0; j < wB; ++j)
{
double sum = 0;
for (unsigned int k = 0; k < wA; ++k)
{
double a = A[i * wA + k];
double b = B[k * wB + j];
sum += a * b;
}
C[i * wB + j] = (float)sum;
}
}
// Allocates a matrix with random float entries.
void randomInit(float *data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = rand() / (float)RAND_MAX;
}
void printDiff(float *data1, float *data2, int width, int height, int iListLength, float fListTol)
{
printf("Listing first %d Differences > %.6f...\n", iListLength, fListTol);
int i,j,k;
int error_count=0;
for (j = 0; j < height; j++)
{
if (error_count < iListLength)
{
printf("\n Row %d:\n", j);
}
for (i = 0; i < width; i++)
{
k = j * width + i;
float fDiff = fabs(data1[k] - data2[k]);
if (fDiff > fListTol)
{
if (error_count < iListLength)
{
printf(" Loc(%d,%d)\tCPU=%.5f\tGPU=%.5f\tDiff=%.6f\n", i, j, data1[k], data2[k], fDiff);
}
error_count++;
}
}
}
printf(" \n Total Errors = %d\n", error_count);
}
void initializeCUDA(int argc, char **argv, int &devID, int &iSizeMultiple, sMatrixSize &matrix_size)
{
// By default, we use device 0, otherwise we override the device ID based on what is provided at the command line
cudaError_t error;
devID = 0;
if (checkCmdLineFlag(argc, (const char **)argv, "device"))
{
devID = getCmdLineArgumentInt(argc, (const char **)argv, "device");
error = cudaSetDevice(devID);
if (error != cudaSuccess)
{
printf("cudaSetDevice returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
}
// get number of SMs on this GPU
error = cudaGetDevice(&devID);
if (error != cudaSuccess)
{
printf("cudaGetDevice returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
if (checkCmdLineFlag(argc, (const char **)argv, "sizemult"))
{
iSizeMultiple = getCmdLineArgumentInt(argc, (const char **)argv, "sizemult");
}
iSizeMultiple = min(iSizeMultiple, 10);
iSizeMultiple = max(iSizeMultiple, 1);
cudaDeviceProp deviceProp;
error = cudaGetDeviceProperties(&deviceProp, devID);
if (error != cudaSuccess)
{
printf("cudaGetDeviceProperties returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
printf("GPU Device %d: \"%s\" with compute capability %d.%d\n\n", devID, deviceProp.name, deviceProp.major, deviceProp.minor);
// use a larger block size for Fermi and above
int block_size = (deviceProp.major < 2) ? 16 : 32;
matrix_size.uiWA = 3 * block_size * iSizeMultiple;
matrix_size.uiHA = 4 * block_size * iSizeMultiple;
matrix_size.uiWB = 2 * block_size * iSizeMultiple;
matrix_size.uiHB = 3 * block_size * iSizeMultiple;
matrix_size.uiWC = 2 * block_size * iSizeMultiple;
matrix_size.uiHC = 4 * block_size * iSizeMultiple;
printf("MatrixA(%u,%u), MatrixB(%u,%u), MatrixC(%u,%u)\n",
matrix_size.uiHA, matrix_size.uiWA,
matrix_size.uiHB, matrix_size.uiWB,
matrix_size.uiHC, matrix_size.uiWC);
if( matrix_size.uiWA != matrix_size.uiHB ||
matrix_size.uiHA != matrix_size.uiHC ||
matrix_size.uiWB != matrix_size.uiWC)
{
printf("ERROR: Matrix sizes do not match!\n");
exit(-1);
}
}
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test matrix multiply using CUBLAS
////////////////////////////////////////////////////////////////////////////////
int matrixMultiply(int argc, char **argv, int devID, sMatrixSize &matrix_size)
{
cudaDeviceProp deviceProp;
checkCudaErrors(cudaGetDeviceProperties(&deviceProp, devID));
// use a larger block size for Fermi and above
int block_size = (deviceProp.major < 2) ? 16 : 32;
// set seed for rand()
srand(2006);
// allocate host memory for matrices A and B
unsigned int size_A = matrix_size.uiWA * matrix_size.uiHA;
unsigned int mem_size_A = sizeof(float) * size_A;
float *h_A = (float *)malloc(mem_size_A);
unsigned int size_B = matrix_size.uiWB * matrix_size.uiHB;
unsigned int mem_size_B = sizeof(float) * size_B;
float *h_B = (float *)malloc(mem_size_B);
// set seed for rand()
srand(2006);
// initialize host memory
randomInit(h_A, size_A);
randomInit(h_B, size_B);
// allocate device memory
float *d_A, *d_B, *d_C;
unsigned int size_C = matrix_size.uiWC * matrix_size.uiHC;
unsigned int mem_size_C = sizeof(float) * size_C;
// allocate host memory for the result
float *h_C = (float *) malloc(mem_size_C);
float *h_CUBLAS = (float *) malloc(mem_size_C);
checkCudaErrors(cudaMalloc((void **) &d_A, mem_size_A));
checkCudaErrors(cudaMalloc((void **) &d_B, mem_size_B));
checkCudaErrors(cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice));
checkCudaErrors(cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice));
checkCudaErrors(cudaMalloc((void **) &d_C, mem_size_C));
// setup execution parameters
dim3 threads(block_size, block_size);
dim3 grid(matrix_size.uiWC / threads.x, matrix_size.uiHC / threads.y);
// create and start timer
printf("Computing result using CUBLAS...");
// execute the kernel
int nIter = 30;
// CUBLAS version 2.0
{
const float alpha = 1.0f;
const float beta = 0.0f;
cublasHandle_t handle;
cudaEvent_t start, stop;
checkCudaErrors(cublasCreate(&handle));
//Perform warmup operation with cublas
checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB));
// Allocate CUDA events that we'll use for timing
checkCudaErrors(cudaEventCreate(&start));
checkCudaErrors(cudaEventCreate(&stop));
// Record the start event
checkCudaErrors(cudaEventRecord(start, NULL));
for (int j = 0; j < nIter; j++)
{
//note cublas is column primary!
//need to transpose the order
checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB));
}
printf("done.\n");
// Record the stop event
checkCudaErrors(cudaEventRecord(stop, NULL));
// Wait for the stop event to complete
checkCudaErrors(cudaEventSynchronize(stop));
float msecTotal = 0.0f;
checkCudaErrors(cudaEventElapsedTime(&msecTotal, start, stop));
// Compute and print the performance
float msecPerMatrixMul = msecTotal / nIter;
double flopsPerMatrixMul = 2.0 * (double)matrix_size.uiHC * (double)matrix_size.uiWC * (double)matrix_size.uiHB;
double gigaFlops = (flopsPerMatrixMul * 1.0e-9f) / (msecPerMatrixMul / 1000.0f);
printf(
"Performance= %.2f GFlop/s, Time= %.3f msec, Size= %.0f Ops\n",
gigaFlops,
msecPerMatrixMul,
flopsPerMatrixMul);
// copy result from device to host
checkCudaErrors(cudaMemcpy(h_CUBLAS, d_C, mem_size_C, cudaMemcpyDeviceToHost));
// Destroy the handle
checkCudaErrors(cublasDestroy(handle));
}
// compute reference solution
printf("Computing result using host CPU...");
float *reference = (float *)malloc(mem_size_C);
matrixMulCPU(reference, h_A, h_B, matrix_size.uiHA, matrix_size.uiWA, matrix_size.uiWB);
printf("done.\n");
// check result (CUBLAS)
bool resCUBLAS = sdkCompareL2fe(reference, h_CUBLAS, size_C, 1.0e-6f);
if (resCUBLAS != true)
{
printDiff(reference, h_CUBLAS, matrix_size.uiWC, matrix_size.uiHC, 100, 1.0e-5f);
}
printf("Comparing CUBLAS Matrix Multiply with CPU results: %s\n", (true == resCUBLAS) ? "PASS" : "FAIL");
printf("\nNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.\n");
// clean up memory
free(h_A);
free(h_B);
free(h_C);
free(reference);
checkCudaErrors(cudaFree(d_A));
checkCudaErrors(cudaFree(d_B));
checkCudaErrors(cudaFree(d_C));
// cudaDeviceReset causes the driver to clean up all state. While
// not mandatory in normal operation, it is good practice. It is also
// needed to ensure correct operation when the application is being
// profiled. Calling cudaDeviceReset causes all profile data to be
// flushed before the application exits
cudaDeviceReset();
if (resCUBLAS == true)
{
return EXIT_SUCCESS; // return value = 1
}
else
{
return EXIT_FAILURE; // return value = 0
}
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
printf("[Matrix Multiply CUBLAS] - Starting...\n");
int devID = 0, sizeMult = 5;
sMatrixSize matrix_size;
initializeCUDA(argc, argv, devID, sizeMult, matrix_size);
int matrix_result = matrixMultiply(argc, argv, devID, matrix_size);
return matrix_result;
}
The additional memory transfer seems to be caused by the CUBLAS library and is triggered by a call to cublasInit. You can confirm this by profiling the following code:
#include <cublas_v2.h>
int main()
{
cublasHandle_t handle;
cublasCreate(&handle);
cudaDeviceReset();
return 0;
}
which nvprof reports as calling cudaMemcpy:
$ nvprof ./a.out
==9536== NVPROF is profiling process 9536, command: ./a.out
==9536== Profiling application: ./a.out
==9536== Profiling result:
Time(%) Time Calls Avg Min Max Name
100.00% 1.1190us 1 1.1190us 1.1190us 1.1190us [CUDA memcpy HtoD]
==9536== API calls:
Time(%) Time Calls Avg Min Max Name
76.51% 348.53ms 1 348.53ms 348.53ms 348.53ms cudaFree
23.26% 105.97ms 1 105.97ms 105.97ms 105.97ms cudaDeviceReset
0.09% 420.25us 178 2.3600us 125ns 103.52us cuDeviceGetAttribute
0.08% 349.37us 2 174.69us 110.59us 238.78us cuDeviceTotalMem
0.04% 202.10us 3 67.366us 9.3750us 109.43us cudaMalloc
0.01% 55.217us 2 27.608us 24.529us 30.688us cuDeviceGetName
0.00% 14.365us 1 14.365us 14.365us 14.365us cudaMemcpy
0.00% 10.016us 16 626ns 434ns 2.0440us cudaEventCreateWithFlags
0.00% 4.5000us 11 409ns 271ns 1.2730us cudaDeviceGetAttribute
0.00% 3.4510us 4 862ns 251ns 2.3370us cuDeviceGetCount
0.00% 2.3200us 4 580ns 281ns 1.0350us cuDeviceGet
0.00% 1.3600us 1 1.3600us 1.3600us 1.3600us cudaGetDevice
0.00% 630ns 1 630ns 630ns 630ns cuInit
0.00% 339ns 1 339ns 339ns 339ns cuDriverGetVersion
I doubt that anyone without access to the current CUBLAS source will be able to explain why initialising the CUBLAS library triggers a host to device transfer, but that seems to be the cause of your observation.

Solving general sparse linear systems in CUDA

I am currently working on CUDA and trying to solve Ax = b using cuBLAS and cuSPARSE library. I looked through the sample codes including conjugateGradient & conjugateGradientPrecond provided by NVIDIA. However, the conjugate gradient method only works for positive definite matrix and it is an iterative method. Now, I have some general sparse matrices and I think I should take advantage of cuSPARSE library. Does anyone know how can I solve Ax = b using cuSPARSE and cuBLAS libraries? I could not find useful APIs for me. Generally, the matrices are expected to be at least 1000x1000 and in some cases it would go up to 100000x100000. Should I do this using a direct method?
One possibility to solve general sparse linear systems in CUDA is using cuSOLVER.
cuSOLVER has three useful routines:
cusolverSpDcsrlsvlu, which works for square linear systems (number of unknowns equal to the number of equations) and internally uses sparse LU factorization with partial pivoting;
cusolverSpDcsrlsvqr, which works for square linear systems (number of unknowns equal to the number of equations) and internally uses sparse QR factorization;
cusolverSpDcsrlsqvqr, which works for rectangular linear systems (number of unknowns different to the number of equations) and internally solves a least square problem.
For ALL the above routines, the supported matrix type is CUSPARSE_MATRIX_TYPE_GENERAL. If A is symmetric/Hermitian and only lower/upper part is used or meaningful, then its missing upper/lower part must be extended.
NOTES ON cusolverSpDcsrlsvlu
Attention should be paid to two input parameters: tol and reorder. Concerning the former, if the system matrix A is singular, then some diagonal elements of the matrix U of the LU decomposition are zero. The algorithm decides for zero if |U(j,j)|<tol. Concerning the latter, cuSOLVER provides a reordering to reduce
zero fill-in which dramactically affects the performance of LU factorization. reorder toggles between reordering (reorder=1) or not reordering (reorder=0).
Attention should be paid also to an output parameter: singularity. It is -1 if A is invertible, otherwise it provides the first index j such that U(j,j)=0.
NOTES ON cusolverSpDcsrlsvqr
Attention should be paid to the same input/output parameters are before. In particular, tol is used to decide for singularity, reorder has no effect and singularity is -1 if A is invertible, otherwise it returns the first index j such that R(j,j)=0.
NOTES ON cusolverSpDcsrlsqvqr
Attention should be paid to the input parameter tol, which is used to decide the rank of A.
Attention should be also paid to the output parameters rankA, which represents the numerical rank of A, p, a permutation vector of length equal to the number of columns of A (please, see the documentation for further details) and min_norm, which is the norm of the residual ||Ax - b||.
Currently, as of CUDA 10.0, the above three functions are for the host channel only, which means that they do not yet run on GPU. They must be called as:
cusolverSpDcsrlsvluHost;
cusolverSpDcsrlsvqrHost;
cusolverSpDcsrlsqvqrHost,
and the input argument should all reside on the host.
Below, please find a fully worked example using all the above three possibilities:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cusparse.h>
#include <cusolverSp.h>
/*******************/
/* iDivUp FUNCTION */
/*******************/
//extern "C" int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
__host__ __device__ int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
}
}
extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cusolverGetErrorEnum(cusolverStatus_t error)
{
switch (error)
{
case CUSOLVER_STATUS_SUCCESS:
return "CUSOLVER_SUCCESS";
case CUSOLVER_STATUS_NOT_INITIALIZED:
return "CUSOLVER_STATUS_NOT_INITIALIZED";
case CUSOLVER_STATUS_ALLOC_FAILED:
return "CUSOLVER_STATUS_ALLOC_FAILED";
case CUSOLVER_STATUS_INVALID_VALUE:
return "CUSOLVER_STATUS_INVALID_VALUE";
case CUSOLVER_STATUS_ARCH_MISMATCH:
return "CUSOLVER_STATUS_ARCH_MISMATCH";
case CUSOLVER_STATUS_EXECUTION_FAILED:
return "CUSOLVER_STATUS_EXECUTION_FAILED";
case CUSOLVER_STATUS_INTERNAL_ERROR:
return "CUSOLVER_STATUS_INTERNAL_ERROR";
case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
}
return "<unknown>";
}
inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
if (CUSOLVER_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSOLVE error in file '%s', line %d, error: %s \nterminating!\n", __FILE__, __LINE__, \
_cusolverGetErrorEnum(err)); \
assert(0); \
}
}
extern "C" void cusolveSafeCall(cusolverStatus_t err) { __cusolveSafeCall(err, __FILE__, __LINE__); }
/***************************/
/* CUSPARSE ERROR CHECKING */
/***************************/
static const char *_cusparseGetErrorEnum(cusparseStatus_t error)
{
switch (error)
{
case CUSPARSE_STATUS_SUCCESS:
return "CUSPARSE_STATUS_SUCCESS";
case CUSPARSE_STATUS_NOT_INITIALIZED:
return "CUSPARSE_STATUS_NOT_INITIALIZED";
case CUSPARSE_STATUS_ALLOC_FAILED:
return "CUSPARSE_STATUS_ALLOC_FAILED";
case CUSPARSE_STATUS_INVALID_VALUE:
return "CUSPARSE_STATUS_INVALID_VALUE";
case CUSPARSE_STATUS_ARCH_MISMATCH:
return "CUSPARSE_STATUS_ARCH_MISMATCH";
case CUSPARSE_STATUS_MAPPING_ERROR:
return "CUSPARSE_STATUS_MAPPING_ERROR";
case CUSPARSE_STATUS_EXECUTION_FAILED:
return "CUSPARSE_STATUS_EXECUTION_FAILED";
case CUSPARSE_STATUS_INTERNAL_ERROR:
return "CUSPARSE_STATUS_INTERNAL_ERROR";
case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
case CUSPARSE_STATUS_ZERO_PIVOT:
return "CUSPARSE_STATUS_ZERO_PIVOT";
}
return "<unknown>";
}
inline void __cusparseSafeCall(cusparseStatus_t err, const char *file, const int line)
{
if (CUSPARSE_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSPARSE error in file '%s', line %Ndims\Nobjs %s\nerror %Ndims: %s\nterminating!\Nobjs", __FILE__, __LINE__, err, \
_cusparseGetErrorEnum(err)); \
cudaDeviceReset(); assert(0); \
}
}
extern "C" void cusparseSafeCall(cusparseStatus_t err) { __cusparseSafeCall(err, __FILE__, __LINE__); }
/********/
/* MAIN */
/********/
int main()
{
// --- Initialize cuSPARSE
cusparseHandle_t handle; cusparseSafeCall(cusparseCreate(&handle));
const int Nrows = 4; // --- Number of rows
const int Ncols = 4; // --- Number of columns
const int N = Nrows;
// --- Host side dense matrix
double *h_A_dense = (double*)malloc(Nrows*Ncols*sizeof(*h_A_dense));
// --- Column-major ordering
h_A_dense[0] = 1.0f; h_A_dense[4] = 4.0f; h_A_dense[8] = 0.0f; h_A_dense[12] = 0.0f;
h_A_dense[1] = 0.0f; h_A_dense[5] = 2.0f; h_A_dense[9] = 3.0f; h_A_dense[13] = 0.0f;
h_A_dense[2] = 5.0f; h_A_dense[6] = 0.0f; h_A_dense[10] = 0.0f; h_A_dense[14] = 7.0f;
h_A_dense[3] = 0.0f; h_A_dense[7] = 0.0f; h_A_dense[11] = 9.0f; h_A_dense[15] = 0.0f;
//create device array and copy host to it
double *d_A_dense; gpuErrchk(cudaMalloc(&d_A_dense, Nrows * Ncols * sizeof(*d_A_dense)));
gpuErrchk(cudaMemcpy(d_A_dense, h_A_dense, Nrows * Ncols * sizeof(*d_A_dense), cudaMemcpyHostToDevice));
// --- Descriptor for sparse matrix A
cusparseMatDescr_t descrA; cusparseSafeCall(cusparseCreateMatDescr(&descrA));
cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO);
int nnz = 0; // --- Number of nonzero elements in dense matrix
const int lda = Nrows; // --- Leading dimension of dense matrix
// --- Device side number of nonzero elements per row
int *d_nnzPerVector; gpuErrchk(cudaMalloc(&d_nnzPerVector, Nrows * sizeof(*d_nnzPerVector)));
cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector, &nnz));
// --- Host side number of nonzero elements per row
int *h_nnzPerVector = (int *)malloc(Nrows * sizeof(*h_nnzPerVector));
gpuErrchk(cudaMemcpy(h_nnzPerVector, d_nnzPerVector, Nrows * sizeof(*h_nnzPerVector), cudaMemcpyDeviceToHost));
printf("Number of nonzero elements in dense matrix = %i\n\n", nnz);
for (int i = 0; i < Nrows; ++i) printf("Number of nonzero elements in row %i = %i \n", i, h_nnzPerVector[i]);
printf("\n");
// --- Device side dense matrix
double *d_A; gpuErrchk(cudaMalloc(&d_A, nnz * sizeof(*d_A)));
int *d_A_RowIndices; gpuErrchk(cudaMalloc(&d_A_RowIndices, (Nrows + 1) * sizeof(*d_A_RowIndices)));
int *d_A_ColIndices; gpuErrchk(cudaMalloc(&d_A_ColIndices, nnz * sizeof(*d_A_ColIndices)));
cusparseSafeCall(cusparseDdense2csr(handle, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector, d_A, d_A_RowIndices, d_A_ColIndices));
// --- Host side dense matrix
double *h_A = (double *)malloc(nnz * sizeof(*h_A));
int *h_A_RowIndices = (int *)malloc((Nrows + 1) * sizeof(*h_A_RowIndices));
int *h_A_ColIndices = (int *)malloc(nnz * sizeof(*h_A_ColIndices));
gpuErrchk(cudaMemcpy(h_A, d_A, nnz*sizeof(*h_A), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_RowIndices, d_A_RowIndices, (Nrows + 1) * sizeof(*h_A_RowIndices), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_ColIndices, d_A_ColIndices, nnz * sizeof(*h_A_ColIndices), cudaMemcpyDeviceToHost));
for (int i = 0; i < nnz; ++i) printf("A[%i] = %.0f ", i, h_A[i]); printf("\n");
for (int i = 0; i < (Nrows + 1); ++i) printf("h_A_RowIndices[%i] = %i \n", i, h_A_RowIndices[i]); printf("\n");
for (int i = 0; i < nnz; ++i) printf("h_A_ColIndices[%i] = %i \n", i, h_A_ColIndices[i]);
// --- Allocating and defining dense host and device data vectors
double *h_y = (double *)malloc(Nrows * sizeof(double));
h_y[0] = 100.0; h_y[1] = 200.0; h_y[2] = 400.0; h_y[3] = 500.0;
double *d_y; gpuErrchk(cudaMalloc(&d_y, Nrows * sizeof(double)));
gpuErrchk(cudaMemcpy(d_y, h_y, Nrows * sizeof(double), cudaMemcpyHostToDevice));
// --- Allocating the host and device side result vector
double *h_x = (double *)malloc(Ncols * sizeof(double));
double *d_x; gpuErrchk(cudaMalloc(&d_x, Ncols * sizeof(double)));
// --- CUDA solver initialization
cusolverSpHandle_t solver_handle;
cusolverSpCreate(&solver_handle);
// --- Using LU factorization
int singularity;
cusolveSafeCall(cusolverSpDcsrlsvluHost(solver_handle, N, nnz, descrA, h_A, h_A_RowIndices, h_A_ColIndices, h_y, 0.000001, 0, h_x, &singularity));
// --- Using QR factorization
//cusolveSafeCall(cusolverSpDcsrlsvqrHost(solver_handle, N, nnz, descrA, h_A, h_A_RowIndices, h_A_ColIndices, h_y, 0.000001, 0, h_x, &singularity));
//int rankA;
//int *p = (int *)malloc(N * sizeof(int));
//double min_norm;
//cusolveSafeCall(cusolverSpDcsrlsqvqrHost(solver_handle, N, N, nnz, descrA, h_A, h_A_RowIndices, h_A_ColIndices, h_y, 0.000001, &rankA, h_x, p, &min_norm));
printf("Showing the results...\n");
for (int i = 0; i < N; i++) printf("%f\n", h_x[i]);
}

How to configure cublas{t}symm() function arguments

This function performs the symmetric matrix-matrix multiplication using CUDA. Although, I succeeded in using the nonsymmetric version "cublas{t}gemm()" I couldn't use the "cublas{t}symm()" function properly.
I know that CUBLAS library uses column-major matrix storage. I am using row-major C/C++ matrix and I know how to solve this issue for "cublas{t}gemm()" by replacing the input matrices and etc. However, I couldn't solve it for the symmetric case. The problem is even if I use column-major matrix storage I find unexpectable results. Matrices contain complex floats (cuComplex). I assume I have row-major matrices. Here is the code and the output:
// Matrix multiplication: C = A * B.
// Host code.
//
// Utilities and system includes
#include <assert.h>
#include <helper_string.h> // helper for shared functions common to CUDA SDK samples
// CUDA runtime
#include <cuda_runtime.h>
#include <cublas_v2.h>
#ifndef min
#define min(a,b) ((a < b) ? a : b)
#endif
#ifndef max
#define max(a,b) ((a > b) ? a : b)
#endif
////////////////////////////////////////////////////////////////////////////////
// These are CUDA Helper functions (in addition to helper_cuda.h)
void inline checkError(cublasStatus_t status, const char *msg)
{
if (status != CUBLAS_STATUS_SUCCESS)
{
printf("%s", msg);
exit(EXIT_FAILURE);
}
}
// end of CUDA Helper Functions
// Allocates a matrix with random float entries.
void randomCmplxInit(cuComplex *data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = make_cuComplex( rand() / (float)RAND_MAX, rand() / (float)RAND_MAX);
}
//void initializeCUDA(int argc, char **argv, int &devID, int &iSizeMultiple, sMatrixSize &matrix_size)
void initializeCUDA(int argc, char **argv, int &devID)
{
// By default, we use device 0, otherwise we override the device ID based on what is provided at the command line
cudaError_t error;
devID = 0;
int m,n,k;
if (checkCmdLineFlag(argc, (const char **)argv, "device"))
{
devID = getCmdLineArgumentInt(argc, (const char **)argv, "device");
error = cudaSetDevice(devID);
if (error != cudaSuccess)
{
printf("cudaSetDevice returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
}
// get number of SMs on this GPU
error = cudaGetDevice(&devID);
cudaDeviceProp deviceProp;
error = cudaGetDeviceProperties(&deviceProp, devID);
printf("GPU Device %d: \"%s\" with compute capability %d.%d\n\n", devID, deviceProp.name, deviceProp.major, deviceProp.minor);
// use a larger block size for Fermi and above
int block_size = (deviceProp.major < 2) ? 16 : 32;
}
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test matrix multiply using CUBLAS
////////////////////////////////////////////////////////////////////////////////
int matrixMultiply(int argc, char **argv, int devID)
{
int i,j;
unsigned int m,n,k;
cudaDeviceProp deviceProp;
cudaError_t error;
error = cudaGetDeviceProperties(&deviceProp, devID);
if (error != cudaSuccess)
{
printf("cudaGetDeviceProperties returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
// use a larger block size for Fermi and above
int block_size = (deviceProp.major < 2) ? 16 : 32;
m=3; //number of rows of matrix op(A) and C. A--> (m x k)
n=2; //number of columns of matrix op(B) and C. B--> (k x n)
k=m; //number of columns of op(A) and rows of op(B). C--> (m x n)
// I want to compute C = A*B in row-major format,
//so I must find C(T)=B(T)A(T) = C(T)A in column-major format
// allocate host memory for matrices A and B
unsigned int size_A = m*(m+1)/2; //size of a symmetric matrix
unsigned int mem_size_A = sizeof(cuComplex) * size_A;
cuComplex *h_A = (cuComplex *)malloc(mem_size_A);
unsigned int size_B = m*n;
unsigned int mem_size_B = sizeof(cuComplex) * size_B;
cuComplex *h_B = (cuComplex *)malloc(mem_size_B);
// initialize host memory
for (i = 0; i < size_A; ++i)
h_A[i] = make_cuComplex( (float)(i+1),(float)0);
for (i = 0; i < size_B; ++i)
h_B[i] = make_cuComplex((float)(i+2), (float)0);
// allocate device memory
cuComplex *d_A, *d_B, *d_C;
unsigned int size_C = m*n;
unsigned int mem_size_C = sizeof(cuComplex) * size_C;
// allocate host memory for the result
cuComplex *h_C = (cuComplex *) malloc(mem_size_C);
cuComplex *h_CUBLAS = (cuComplex *) malloc(mem_size_C);
error = cudaMalloc((void **) &d_A, mem_size_A);
error = cudaMalloc((void **) &d_B, mem_size_B);
// copy host memory to device
error = cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice);
error = cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice);
error = cudaMalloc((void **) &d_C, mem_size_C);
// setup execution parameters
dim3 threads(block_size, block_size);
dim3 grid(n / threads.x, m / threads.y);
// create and start timer
printf("Computing result using CUBLAS...");
// CUBLAS version 2.0
{
cublasHandle_t handle;
cublasStatus_t ret;
ret = cublasCreate(&handle);
if (ret != CUBLAS_STATUS_SUCCESS)
{
printf("cublasCreate returned error code %d, line(%d)\n", ret, __LINE__);
exit(EXIT_FAILURE);
}
const cuComplex alpha = make_cuComplex(1.0f,0.0f);
const cuComplex beta = make_cuComplex(0.0f,0.0f);
//Perform operation with cublas
ret = cublasCsymm(handle, CUBLAS_SIDE_RIGHT, CUBLAS_FILL_MODE_UPPER, n,m,&alpha,d_A,m,d_B,m,&beta,d_C,m);
// copy result from device to host
error = cudaMemcpy(h_CUBLAS, d_C, mem_size_C, cudaMemcpyDeviceToHost);
checkError(cublasDestroy(handle), "cublasDestroy() error!\n");
}
printf ("\nComputations completed.\n\n");
printf (" symm matrix A: \n");
int s=0;
for (i=0; i<min(m,4); i++) {
for (j=0; j<=i; j++) {
//printf ("%7.5G + j(%7.5G)", h_A[j+i*k].x,h_A[j+i*k].y);
printf ("%7.5G", h_A[s].x);
s++;
}
printf ("\n");
}
printf ("\n matrix B: \n");
for (i=0; i<min(k,4); i++) {
for (j=0; j<min(n,4); j++) {
//printf ("%7.5G + j(%7.5G)", h_B[j+i*n].x,h_B[j+i*n].y);
printf ("%7.5G", h_B[j+i*n].x);
}
printf ("\n");
}
printf ("\n matrix C=A*B: \n");
for (i=0; i<min(m,4); i++) {
for (j=0; j<min(n,4); j++) {
//printf ("%7.5G + j(%7.5G)", h_CUBLAS[j+i*n].x,h_CUBLAS[j+i*n].y);
printf ("%7.5G", h_CUBLAS[j+i*n].x);
}
printf ("\n");
}
// clean up memory
free(h_A);
free(h_B);
free(h_C);
//free(reference);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaDeviceReset();
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
printf("[Matrix Multiply CUBLAS] - Starting...\n");
int devID = 0, sizeMult = 5;
initializeCUDA(argc, argv, devID);
int matrix_result = matrixMultiply(argc, argv, devID);
}
I suppose that I have the following matrices for the multiplication:
A =
1 2 4
2 3 5
4 5 6
B =
2 3
4 5
6 7
and expect to obtain
A*B =
34 41
46 56
64 79
But the obtained OUTPUT is as follows:
symm matrix A:
1
2 3
4 5 6
matrix B:
2 3
4 5
6 7
matrix C=A*B:
78 90
74 97
114 146
What am I missing in this code ? Probably the arguments of "cublasCsymm" function are wrong.
Thanks,
Kagan
EDIT:
Based on questions posed below, I elected to re-work my answer and example code.
You can handle row-major storage without transpose at least for these operations. And this observation is further facilitated by the fact that the symm function does not used the packed storage.
So to answer the additional questions:
the cublasCsymm function does not use a packed storage format (like some other functions such as cublasCspmv for example), because the cublasCsymm function is intended to duplicate the functionality of the corresponding netlib function, which also does not use a packed storage format. Based on my review of the cublas API, I don't see a symmetric-packed-storage matrix-matrix multiply function available.
You can use row-major storage (e.g. C-style) with cublas, without transposing, at least for these operations (matrix-matrix multiply, without packed storage) by following the advice given here.
What follows is a re-worked version of my previous example, that incorporates the information in item 2 above.
// Matrix multiplication: C = A * B.
// Host code.
//
// Utilities and system includes
#include <assert.h>
#include <helper_string.h> // helper for shared functions common to CUDA SDK sa
mples
// CUDA runtime
#include <cuda_runtime.h>
#include <cublas_v2.h>
// error check macros
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for CUBLAS V2 API
#define cublasCheckErrors(fn) \
do { \
cublasStatus_t __err = fn; \
if (__err != CUBLAS_STATUS_SUCCESS) { \
fprintf(stderr, "Fatal cublas error: %d (at %s:%d)\n", \
(int)(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#ifndef min
#define min(a,b) ((a < b) ? a : b)
#endif
#ifndef max
#define max(a,b) ((a > b) ? a : b)
#endif
////////////////////////////////////////////////////////////////////////////////
// These are CUDA Helper functions (in addition to helper_cuda.h)
void inline checkError(cublasStatus_t status, const char *msg)
{
if (status != CUBLAS_STATUS_SUCCESS)
{
printf("%s", msg);
exit(EXIT_FAILURE);
}
}
// end of CUDA Helper Functions
// Allocates a matrix with random float entries.
void randomCmplxInit(cuComplex *data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = make_cuComplex( rand() / (float)RAND_MAX, rand() / (float)RAND
_MAX);
}
//void initializeCUDA(int argc, char **argv, int &devID, int &iSizeMultiple, sMa
trixSize &matrix_size)
void initializeCUDA(int argc, char **argv, int &devID)
{
// By default, we use device 0, otherwise we override the device ID based on
what is provided at the command line
cudaError_t error;
devID = 0;
if (checkCmdLineFlag(argc, (const char **)argv, "device"))
{
devID = getCmdLineArgumentInt(argc, (const char **)argv, "device");
error = cudaSetDevice(devID);
if (error != cudaSuccess)
{
printf("cudaSetDevice returned error code %d, line(%d)\n", error, __
LINE__);
exit(EXIT_FAILURE);
}
}
// get number of SMs on this GPU
error = cudaGetDevice(&devID);
cudaDeviceProp deviceProp;
error = cudaGetDeviceProperties(&deviceProp, devID);
printf("GPU Device %d: \"%s\" with compute capability %d.%d\n\n", devID, dev
iceProp.name, deviceProp.major, deviceProp.minor);
}
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test matrix multiply using CUBLAS
////////////////////////////////////////////////////////////////////////////////
int matrixMultiply(int argc, char **argv, int devID)
{
int i,j;
unsigned int m,n,k;
cudaDeviceProp deviceProp;
cudaError_t error;
error = cudaGetDeviceProperties(&deviceProp, devID);
if (error != cudaSuccess)
{
printf("cudaGetDeviceProperties returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
// use a larger block size for Fermi and above
m=3; //number of rows of matrix op(A) and C. A--> (m x k)
n=2; //number of columns of matrix op(B) and C. B--> (k x n)
k=m; //number of columns of op(A) and rows of op(B). C--> (m x n)
// I want to compute C = A*B in row-major format,
//so I must find C(T)=B(T)A(T) = C(T)A in column-major format
// allocate host memory for matrices A and B
unsigned int size_A = m*m; //size of a symmetric matrix
printf("size_A = %d\n", size_A);
unsigned int mem_size_A = sizeof(cuComplex) * size_A;
cuComplex *h_A = (cuComplex *)malloc(mem_size_A);
unsigned int size_B = m*n;
unsigned int mem_size_B = sizeof(cuComplex) * size_B;
cuComplex *h_B = (cuComplex *)malloc(mem_size_B);
// initialize host memory
// for (i = 0; i < size_A; ++i)
// h_A[i] = make_cuComplex( (float)(i+1),(float)0);
h_A[0] = make_cuComplex((float)1, (float)0);
h_A[1] = make_cuComplex((float)2, (float)0);
h_A[2] = make_cuComplex((float)4, (float)0);
h_A[3] = make_cuComplex((float)0, (float)0);
h_A[4] = make_cuComplex((float)3, (float)0);
h_A[5] = make_cuComplex((float)5, (float)0);
h_A[6] = make_cuComplex((float)0, (float)0);
h_A[7] = make_cuComplex((float)0, (float)0);
h_A[8] = make_cuComplex((float)6, (float)0);
// for (i = 0; i < size_B; ++i)
// h_B[i] = make_cuComplex((float)(i+2), (float)0);
h_B[0] = make_cuComplex((float)2, (float)0);
h_B[1] = make_cuComplex((float)3, (float)0);
h_B[2] = make_cuComplex((float)4, (float)0);
h_B[3] = make_cuComplex((float)5, (float)0);
h_B[4] = make_cuComplex((float)6, (float)0);
h_B[5] = make_cuComplex((float)7, (float)0);
// allocate device memory
cuComplex *d_A, *d_B, *d_C;
unsigned int size_C = m*n;
unsigned int mem_size_C = sizeof(cuComplex) * size_C;
// allocate host memory for the result
cuComplex *h_C = (cuComplex *) malloc(mem_size_C);
cuComplex *h_CUBLAS = (cuComplex *) malloc(mem_size_C);
error = cudaMalloc((void **) &d_A, mem_size_A);
error = cudaMalloc((void **) &d_B, mem_size_B);
// copy host memory to device
error = cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice);
error = cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice);
error = cudaMalloc((void **) &d_C, mem_size_C);
// create and start timer
printf("Computing result using CUBLAS...");
// CUBLAS version 2.0
{
cublasHandle_t handle;
cublasStatus_t ret;
ret = cublasCreate(&handle);
if (ret != CUBLAS_STATUS_SUCCESS)
{
printf("cublasCreate returned error code %d, line(%d)\n", ret, __LINE__);
exit(EXIT_FAILURE);
}
const cuComplex alpha = make_cuComplex(1.0f,0.0f);
const cuComplex beta = make_cuComplex(0.0f,0.0f);
//Perform operation with cublas
ret = cublasCsymm(handle, CUBLAS_SIDE_RIGHT, CUBLAS_FILL_MODE_LOWER, n,m,&alpha,d_A,m,d_B,n,&beta,d_C,n);
if (ret != CUBLAS_STATUS_SUCCESS)
{
printf("cublasCsymm returned error code %d, line(%d)\n", ret, __LINE__);
exit(EXIT_FAILURE);
}
// copy result from device to host
error = cudaMemcpy(h_CUBLAS, d_C, mem_size_C, cudaMemcpyDeviceToHost);
checkError(cublasDestroy(handle), "cublasDestroy() error!\n");
}
printf ("\nComputations completed.\n\n");
printf (" symm matrix A: \n");
// int s=0;
for (i=0; i<min(m,4); i++) {
for (j=0; j<min(m,4); j++) {
//printf ("%7.5G + j(%7.5G)", h_A[j+i*k].x,h_A[j+i*k].y);
// printf ("%7.5G", h_A[s].x);
printf ("%7.5G", h_A[j+(i*m)].x);
// s++;
}
printf ("\n");
}
printf ("\n matrix B: \n");
for (i=0; i<min(k,4); i++) {
for (j=0; j<min(n,4); j++) {
//printf ("%7.5G + j(%7.5G)", h_B[j+i*n].x,h_B[j+i*n].y);
printf ("%7.5G", h_B[j+(i*n)].x);
}
printf ("\n");
}
printf ("\n matrix C=A*B: \n");
for (i=0; i<min(m,4); i++) {
for (j=0; j<min(n,4); j++) {
//printf ("%7.5G + j(%7.5G)", h_CUBLAS[j+i*n].x,h_CUBLAS[j+i*n].y);
printf ("%7.5G", h_CUBLAS[j+(i*n)].x);
}
printf ("\n");
}
// clean up memory
free(h_A);
free(h_B);
free(h_C);
//free(reference);
cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);
cudaDeviceReset();
return 0;
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
printf("[Matrix Multiply CUBLAS] - Starting...\n");
int devID = 0;
initializeCUDA(argc, argv, devID);
int matrix_result = matrixMultiply(argc, argv, devID);
cudaCheckErrors("some error");
return 0;
}
$ ./t213
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla M2070" with compute capability 2.0
size_A = 9
Computing result using CUBLAS...
Computations completed.
symm matrix A:
1 2 4
0 3 5
0 0 6
matrix B:
2 3
4 5
6 7
matrix C=A*B:
34 41
46 56
64 79
$
ORIGINAL RESPONSE:
Several problems:
When I run your code as you have it posted right now, I don't get the
results that you show. Here's what I get:
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla M2070" with compute capability 2.0
Computing result using CUBLAS...
Computations completed.
symm matrix A:
1
2 3
4 5 6
matrix B:
2 3
4 5
6 7
matrix C=A*B:
-131 -128
260 -122
-115 266
The code compiles with a number of warnings and also you're not doing proper error checking (for example you're not checking the return value from cublasCsymm
You are wanting to multiply C = A*B This means A is on the LEFT,
but you are passing CUBLAS_SIDE_RIGHT to cublasCsymm Several other cublasCsymm parameters were wrong as well. I think maybe you thought you could do A*B as (B(T)*A(T)) but that only works for square matrices. Not sure what you were thinking, exactly.
You having row-major storage on your matrices and passing them to cublas which interprets them in column-major order. For the following matrix:
1 2
3 4
row-major storage looks like this:
1 2 3 4
column-major storage looks like this:
1 3 2 4
You can transpose these matrices if you wish, using cublasCgeam or you can manually modify your storage.
You're making some sort of assumption about some kind of compressed
storage format for the symmetric matrix A which is not correct.
Read carefully the defintion of the storage
type.
It doesn't say the portion of the matrix that is "supplied" or
"present" it says the portion of the matrix that is filled.
Here is a complete code that has the above problems fixed:
// Matrix multiplication: C = A * B.
// Host code.
//
// Utilities and system includes
#include <assert.h>
#include <helper_string.h> // helper for shared functions common to CUDA SDK sa
mples
// CUDA runtime
#include <cuda_runtime.h>
#include <cublas_v2.h>
// error check macros
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
// for CUBLAS V2 API
#define cublasCheckErrors(fn) \
do { \
cublasStatus_t __err = fn; \
if (__err != CUBLAS_STATUS_SUCCESS) { \
fprintf(stderr, "Fatal cublas error: %d (at %s:%d)\n", \
(int)(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
#ifndef min
#define min(a,b) ((a < b) ? a : b)
#endif
#ifndef max
#define max(a,b) ((a > b) ? a : b)
#endif
////////////////////////////////////////////////////////////////////////////////
// These are CUDA Helper functions (in addition to helper_cuda.h)
void inline checkError(cublasStatus_t status, const char *msg)
{
if (status != CUBLAS_STATUS_SUCCESS)
{
printf("%s", msg);
exit(EXIT_FAILURE);
}
}
// end of CUDA Helper Functions
// Allocates a matrix with random float entries.
void randomCmplxInit(cuComplex *data, int size)
{
for (int i = 0; i < size; ++i)
data[i] = make_cuComplex( rand() / (float)RAND_MAX, rand() / (float)RAND_MAX);
}
//void initializeCUDA(int argc, char **argv, int &devID, int &iSizeMultiple, sMatrixSize &matrix_size)
void initializeCUDA(int argc, char **argv, int &devID)
{
// By default, we use device 0, otherwise we override the device ID based on what is provided at the command line
cudaError_t error;
devID = 0;
if (checkCmdLineFlag(argc, (const char **)argv, "device"))
{
devID = getCmdLineArgumentInt(argc, (const char **)argv, "device");
error = cudaSetDevice(devID);
if (error != cudaSuccess)
{
printf("cudaSetDevice returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
}
// get number of SMs on this GPU
error = cudaGetDevice(&devID);
cudaDeviceProp deviceProp;
error = cudaGetDeviceProperties(&deviceProp, devID);
printf("GPU Device %d: \"%s\" with compute capability %d.%d\n\n", devID, deviceProp.name, deviceProp.major, deviceProp.minor);
}
////////////////////////////////////////////////////////////////////////////////
//! Run a simple test matrix multiply using CUBLAS
////////////////////////////////////////////////////////////////////////////////
int matrixMultiply(int argc, char **argv, int devID)
{
int i,j;
unsigned int m,n,k;
cudaDeviceProp deviceProp;
cudaError_t error;
error = cudaGetDeviceProperties(&deviceProp, devID);
if (error != cudaSuccess)
{
printf("cudaGetDeviceProperties returned error code %d, line(%d)\n", error, __LINE__);
exit(EXIT_FAILURE);
}
// use a larger block size for Fermi and above
m=3; //number of rows of matrix op(A) and C. A--> (m x k)
n=2; //number of columns of matrix op(B) and C. B--> (k x n)
k=m; //number of columns of op(A) and rows of op(B). C--> (m x n)
// I want to compute C = A*B in row-major format,
//so I must find C(T)=B(T)A(T) = C(T)A in column-major format
// allocate host memory for matrices A and B
unsigned int size_A = m*m; //size of a symmetric matrix
printf("size_A = %d\n", size_A);
unsigned int mem_size_A = sizeof(cuComplex) * size_A;
cuComplex *h_A = (cuComplex *)malloc(mem_size_A);
unsigned int size_B = m*n;
unsigned int mem_size_B = sizeof(cuComplex) * size_B;
cuComplex *h_B = (cuComplex *)malloc(mem_size_B);
// initialize host memory
// for (i = 0; i < size_A; ++i)
// h_A[i] = make_cuComplex( (float)(i+1),(float)0);
h_A[0] = make_cuComplex((float)1, (float)0);
h_A[1] = make_cuComplex((float)2, (float)0);
h_A[2] = make_cuComplex((float)4, (float)0);
h_A[3] = make_cuComplex((float)0, (float)0);
h_A[4] = make_cuComplex((float)3, (float)0);
h_A[5] = make_cuComplex((float)5, (float)0);
h_A[6] = make_cuComplex((float)0, (float)0);
h_A[7] = make_cuComplex((float)0, (float)0);
h_A[8] = make_cuComplex((float)6, (float)0);
// for (i = 0; i < size_B; ++i)
// h_B[i] = make_cuComplex((float)(i+2), (float)0);
h_B[0] = make_cuComplex((float)2, (float)0);
h_B[1] = make_cuComplex((float)4, (float)0);
h_B[2] = make_cuComplex((float)6, (float)0);
h_B[3] = make_cuComplex((float)3, (float)0);
h_B[4] = make_cuComplex((float)5, (float)0);
h_B[5] = make_cuComplex((float)7, (float)0);
// allocate device memory
cuComplex *d_A, *d_B, *d_C;
unsigned int size_C = m*n;
unsigned int mem_size_C = sizeof(cuComplex) * size_C;
// allocate host memory for the result
cuComplex *h_C = (cuComplex *) malloc(mem_size_C);
cuComplex *h_CUBLAS = (cuComplex *) malloc(mem_size_C);
error = cudaMalloc((void **) &d_A, mem_size_A);
error = cudaMalloc((void **) &d_B, mem_size_B);
// copy host memory to device
error = cudaMemcpy(d_A, h_A, mem_size_A, cudaMemcpyHostToDevice);
error = cudaMemcpy(d_B, h_B, mem_size_B, cudaMemcpyHostToDevice);
error = cudaMalloc((void **) &d_C, mem_size_C);
// create and start timer
printf("Computing result using CUBLAS...");
// CUBLAS version 2.0
{
cublasHandle_t handle;
cublasStatus_t ret;
ret = cublasCreate(&handle);
if (ret != CUBLAS_STATUS_SUCCESS)
{
printf("cublasCreate returned error code %d, line(%d)\n", ret, __LINE__);
exit(EXIT_FAILURE);
}
const cuComplex alpha = make_cuComplex(1.0f,0.0f);
const cuComplex beta = make_cuComplex(0.0f,0.0f);
//Perform operation with cublas
ret = cublasCsymm(handle, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_LOWER, m,n,&alpha,d_A,m,d_B,m,&beta,d_C,m);
if (ret != CUBLAS_STATUS_SUCCESS)
{
printf("cublasCsymm returned error code %d, line(%d)\n", ret, __LINE__);
exit(EXIT_FAILURE);
}
Here is the output:
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Tesla M2070" with compute capability 2.0
size_A = 9
Computing result using CUBLAS...
Computations completed.
symm matrix A:
1 0 0
2 3 0
4 5 6
matrix B:
2 3
4 5
6 7
matrix C=A*B:
34 41
46 56
64 79

vector addition code failed when executed with cuda driverAPI functions

I'm running on GPU device with driver version 5.0 and using cuda 5.0. When I executed vecAdd completely with driver API functions, I'm facing errors. The code I am working on is posted below. Can anyone help me solve the errors?
#include <stdio.h>
#include <stdlib.h>
#include "timer.h"
#include "vecAdd-kernel.ptx.h"
#include</opt/apps/cuda/5.0/include/cuda.h>
#include</opt/apps/cuda/5.0/include/cuda_runtime_api.h>
void compute_vec_add(int N, float *a, float* b, float *c);
int main(int argc,char** argv) {
static CUcontext ctx;
CUdevice dev;
CUdeviceptr d_a, d_b, d_c;
float *h_a, *h_b, *h_c, *h_temp;
int i;
int N = 1024 * 1024*8;
struct stopwatch_t* timer = NULL;
long double t_pcie_htd, t_pcie_dth, t_kernel, t_cpu;
/* Setup timers */
stopwatch_init ();
timer = stopwatch_create ();
/*
Create the vectors
*/
h_a = (float *) malloc(sizeof(float) * N);
h_b = (float *) malloc(sizeof(float) * N);
h_c = (float *) malloc(sizeof(float) * N);
/*
Set the initial values of h_a, h_b, and h_c
*/
for (i=0; i < N; i++) {
h_a[i] = (float) (rand() % 100) / 10.0;
h_b[i] = (float) (rand() % 100) / 10.0;
h_c[i] = (float) 0.0;
}
stopwatch_start (timer);
CUmodule mod;
CUfunction vecAddFunc;
cuInit(0);
cuDeviceGet(&dev, 0);
cuCtxCreate(&ctx, 0, dev);
cuModuleLoadData(&mod, (char *) imageBytes);
cuModuleGetFunction(&vecAddFunc, mod, "vecAdd");
size_t offset = 0;
cuMemAlloc(&d_a,sizeof(float) * N);
cuMemAlloc(&d_b,sizeof(float) * N);
cuMemAlloc(&d_c,sizeof(float) * N);
cuMemcpyHtoD(d_a,h_a,sizeof(float) * N);
cuMemcpyHtoD(d_b,h_b,sizeof(float) * N);
cuMemcpyHtoD(d_c,h_c,sizeof(float) * N);
cuParamSetv(vecAddFunc, offset, &d_a, sizeof(d_a));
offset += sizeof(d_a);
cuParamSetv(vecAddFunc, offset, &d_b, sizeof(d_b));
offset += sizeof(d_b);
cuParamSetv(vecAddFunc, offset, &d_c, sizeof(d_c));
offset += sizeof(d_c);
cuParamSetSize(vecAddFunc, offset);
cuFuncSetBlockShape (vecAddFunc, 256, 1, 1);
cuLaunchGrid(vecAddFunc, N/256, 1);
cuStreamSynchronize(0);
cuMemcpyDtoH(h_c,d_c,sizeof(float) * N);
t_kernel = stopwatch_stop (timer);
fprintf (stderr, "Time to execute GPU kernel: %Lg secs\n",t_kernel);
/*
Double check errors
*/
h_temp = (float *) malloc(sizeof(float) * N);
stopwatch_start (timer);
compute_vec_add (N, h_a, h_b, h_temp);
t_cpu = stopwatch_stop (timer);
fprintf (stderr, "Time to execute CPU program: %Lg secs\n",
t_cpu);
int cnt = 0;
for(int i = 0; i < N; i++) {
if(abs(h_temp[i] - h_c[i]) > 1e-5) cnt++;
}
fprintf(stderr, "number of errors: %d out of %d\n", cnt, N);
/*
Free the host memory
*/
free(h_a);
free(h_b);
free(h_c);
cuMemFree(d_a);
cuMemFree(d_b);
cuMemFree(d_c);
cuCtxDestroy(ctx);
/*
Free timer
*/
stopwatch_destroy (timer);
if(cnt == 0) {
printf("\n\nSuccess\n");
}
}
void
compute_vec_add(int N, float *a, float* b, float *c) {
int i;
for (i=0;i<N;i++)
c[i]=a[i]+b[i];
}
and when I used cuda-gdb for finding the errors the output is:
Timer: gettimeofday
Timer resolution: ~ 1 us (?)
[New Thread 0x40200940 (LWP 32058)]
[Context Create of context 0x645a80 on Device 0]
warning: Cuda API error detected: cuModuleLoadData returned (0xd1)
warning: Cuda API error detected: cuModuleGetFunction returned (0x190)
warning: Cuda API error detected: cuParamSetv returned (0x190)
warning: Cuda API error detected: cuParamSetv returned (0x190)
warning: Cuda API error detected: cuParamSetv returned (0x190)
warning: Cuda API error detected: cuParamSetSize returned (0x190)
warning: Cuda API error detected: cuFuncSetBlockShape returned (0x190)
Program received signal SIGSEGV, Segmentation fault.
0x00002aaaaadab756 in ?? () from /usr/lib64/libcuda.so.1
Can anyone suggest why I'm facing the error and help me solve it?
The kernel code I'm using is
/*
* **CUDA KERNEL**
*
* Compute the sum of two vectors
* C[i] = A[i] + B[i]
*
*/
extern "C" {
__global__ void vecAdd(float* a, float* b, float* c)
{
/* Calculate index for this thread */
int i = blockIdx.x * blockDim.x + threadIdx.x;
/* Compute the element of C */
c[i] = a[i] + b[i];
}
}
and the compilation commands I'm using are
LDFLAGS = -I/usr/local/cuda/include \
-L/usr/local/cuda/lib64 \
vecAdd-dummy: vecAdd-dummy.cu timer.o vecAdd-kernel.ptx.h
nvcc -o vecAdd-dummy -arch=sm_20 vecAdd-dummy.cu timer.c ${LDFLAGS} -lcuda -g -G
vecAdd-kernel.ptx : vecAdd-kernel.cu
nvcc -arch=sm_20 -ptx $^ -o $#
vecAdd-kernel.ptx.h : vecAdd-kernel.ptx
bin2c -t "char" $^ > $#
the GPU device I'm using is Tesla M2090
Cuda API error detected: cuModuleLoadData returned (0xd1)
/**
* This indicates that there is no kernel image available that is suitable
* for the device. This can occur when a user specifies code generation
* options for a particular CUDA source file that do not include the
* corresponding device configuration.
*/
CUDA_ERROR_NO_BINARY_FOR_GPU = 209,
The rest of the errors are a result of the failure to load the module.
The functions cuParamSet* and cuFuncSetBlockShaper were deprecated in CUDA 4.0. After fixing the module load problem I recommend you look at the function cuLaunchKernel.