Large iteration loops in CUDA kernels causes memory error [duplicate] - cuda

I have a kernel to calculate different elements of a matrix, based on their position (diagonal or off-diagonal). The kernel works as expected when calculating matrices of sizes:
14 x 14 (I understand this is small and does not make proper use of the GPU resources but this was purely for testing purposes to ensure results were correct)
118 x 118, and
300 x 300
However, when I am trying to calculate a matrix of size 2383 x 2383, the kernel crashes. Specifically, the error "Unspecified launch failure" is thrown on the cudaMemcpy() line to return results from device to host. From research, I understand that this error usually arises in the case of an out of bounds memory access (e.g. in an array), however, what I don't get is that it works for the three previous cases but not for the 2383 x 2383 case. The kernel code is shown below:
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
Global memory allocations are done via calls to cudaMalloc(). The allocations made in the code are as follows:
cudaStat1 = cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaStat2 = cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaStat3 = cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaStat4 = cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaStat5 = cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaStat6 = cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
cudaStat7 = cudaMalloc((void**)&dev_Pd, numberOfBuses*sizeof(float));
cudaStat8 = cudaMalloc((void**)&dev_Qd, numberOfBuses*sizeof(float));
cudaStat9 = cudaMalloc((void**)&dev_Vmag, numberOfBuses*sizeof(float));
cudaStat10 = cudaMalloc((void**)&dev_theta, numberOfBuses*sizeof(float));
cudaStat11 = cudaMalloc((void**)&dev_Peq, numberOfBuses*sizeof(float));
cudaStat12 = cudaMalloc((void**)&dev_Qeq, numberOfBuses*sizeof(float));
cudaStat13 = cudaMalloc((void**)&dev_Peq1, numberOfBuses*sizeof(float));
cudaStat14 = cudaMalloc((void**)&dev_Qeq1, numberOfBuses*sizeof(float));
...
...
cudaStat15 = cudaMalloc((void**)&dev_powerMismatch, jacSize*sizeof(float));
cudaStat16 = cudaMalloc((void**)&dev_jacobian, jacSize*jacSize*sizeof(float));
cudaStat17 = cudaMalloc((void**)&dev_stateVector, jacSize*sizeof(float));
cudaStat18 = cudaMalloc((void**)&dev_PQindex, jacSize*sizeof(int));
where cudaStatN are of type cudaError_t to catch errors. The last four allocations were done later on in the code and are for another kernel. However these allocations were done before the kernel in question was called.
The launch parameters are as follows:
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
//copy results back to CPU
cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
cout<<"Device memcpy failed"<<endl;
cout<<cudaGetErrorString(cudaStat6)<<endl;
return 1;
}
I removed the timing code just to show the block and grid dimensions and error checking technique used.
I also have a host (C++ code) version of this function and I'm passing the data to both functions and then comparing results, firstly, to ensure the kernel produces correct results, and secondly in terms of execution time to compare performance. I have double checked the data for the 2383 x 2383 case (it's being read in from a text file and copied to global memory) and I'm not finding any anomalies in array accesses/indexing.
I'm using Visual Studio 2010, so I tried using Nsight to find the error (I'm not too well-versed with Nsight). The summary report overview states: "There was 1 runtime API call error reported. (Please see the CUDA Runtime API Calls report for further information). In the list of runtime API calls, cudaMemcpy returns error 4 - not sure if the Thread ID (5012) is of any significance in the table - this number varies with every run. CUDA memcheck tool (in the command line) returns the following:
Thank you for using this program
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
=========
========= ERROR SUMMARY: 1 error
I know my kernel isn't the most efficient as there are many global memory accesses. Why is the kernel crashing for this larger matrix? Is there an out of bounds array access that I'm missing? Any assistance would be greatly appreciated.

Solved the problem. Turns out the WDDM TDR (timeout detecion recovery) was enabled and the delay was set to 2 seconds. This means that if the kernel execution time exceeds 2s, the driver will crash and recover. This is applicable to graphics and rendering (for general purpose uses of the GPU). In this case however, the TDR must either me disabled or the delay increased. By increasing the delay to 10s, the crash error "unspecified launch failure" ceased to appear and kernel execution continued as before.
The TDR delay (as well as enabling/disabling) can be done through Nsight options in the Nsight Monitor or through the Registry (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers) - DWORDS Tdrdelay and Tdrlevel.

I tried to reproduce your code with the following complete example. The code compiles, runs with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "cuComplex.h"
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
int main ()
{
int numLines = 32 ;
int numberOfBuses = 2383 ;
int* dev_fromBus, *dev_toBus;
float *dev_R, *dev_X, *dev_B;
cuComplex* dev_y ;
cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
cuComplex* y_bus = new cuComplex[numberOfBuses*numberOfBuses] ;
//copy results back to CPU
cudaError_t cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
printf ("failure : (%d) - %s\n", cudaStat6, ::cudaGetErrorString(cudaStat6)) ;
return 1;
}
return 0 ;
}
Your error seems to be somewhere else.
You want to run your code in NSIGHT debug mode with cuda mem check activated. If compiled with debug information, the tool should point out the location of your error.
EDIT: The problem appears to ne caused by WDDM TDR as discussed in comment.

Related

Overlap compute and transfer on Windows

i'm facing some issues when trying to overlap computation and transferts on Windows (using VS2015 and CUDA 10.1). The code doesn't overlap at all. But the exact same code on Linux as the expected behaviour.
Here is the views from NVVP :
Windows 10 NVVP screenshot :
Linux NVVP screenshot :
Please note the following points :
my host memory is PageLocked
i'm using two different streams
i'm using cudaMemcpyAsync method to transfert between host and device
if i run my code on Linux, everything is fine
i don't see anything in the documentation describing a different behaviour between there two systems.
So the question is the following :
Am i missing something ?
Does it exists a way to achieve overlapping on this configuration (Windows 10 + 1080Ti)?
you can find some code here to reproduce the issue :
#include "cuda_runtime.h"
constexpr int NB_ELEMS = 64*1024*1024;
constexpr int BUF_SIZE = NB_ELEMS * sizeof(float);
constexpr int BLK_SIZE=1024;
using namespace std;
__global__
void dummy_operation(float* ptr1, float* ptr2)
{
const int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx<NB_ELEMS)
{
float value = ptr1[idx];
for(int i=0; i<100; ++i)
{
value += 1.0f;
}
ptr2[idx] = value;
}
}
int main()
{
float *h_data1 = nullptr, *h_data2 = nullptr,
*h_data3 = nullptr, *h_data4 = nullptr;
cudaMallocHost(&h_data1, BUF_SIZE);
cudaMallocHost(&h_data2, BUF_SIZE);
cudaMallocHost(&h_data3, BUF_SIZE);
cudaMallocHost(&h_data4, BUF_SIZE);
float *d_data1 = nullptr, *d_data2 = nullptr,
*d_data3 = nullptr, *d_data4 = nullptr;
cudaMalloc(&d_data1, BUF_SIZE);
cudaMalloc(&d_data2, BUF_SIZE);
cudaMalloc(&d_data3, BUF_SIZE);
cudaMalloc(&d_data4, BUF_SIZE);
cudaStream_t st1, st2;
cudaStreamCreate(&st1);
cudaStreamCreate(&st2);
const dim3 threads(BLK_SIZE);
const dim3 blocks(NB_ELEMS / BLK_SIZE + 1);
for(int i=0; i<10; ++i)
{
float* tmp_dev_ptr = (i%2)==0? d_data1 : d_data3;
float* tmp_host_ptr = (i%2)==0? h_data1 : h_data3;
cudaStream_t tmp_st = (i%2)==0? st1 : st2;
cudaMemcpyAsync(tmp_dev_ptr, tmp_host_ptr, BUF_SIZE, cudaMemcpyDeviceToHost, tmp_st);
dummy_operation<<<blocks, threads, 0, tmp_st>>>(tmp_dev_ptr, d_data2);
//cudaMempcyAsync(d_data2, h_data2);
}
cudaStreamSynchronize(st1);
cudaStreamSynchronize(st2);
return 0;
}
As pointed by #talonmies, to overlap compute and transfers you need a graphic card in Tesla Compute Cluster mode.
I've checked this behaviour using an old Quadro P620.
[Edit] Overlapping between kernels and copy seems to be working since i applied the Windows 10 update 1909.
I'm not sure if the windows update has included an graphic driver update or not. But it's fine :)

Why does my CUDA kernel crash (unspecified launch failure) with a different dataset size?

I have a kernel to calculate different elements of a matrix, based on their position (diagonal or off-diagonal). The kernel works as expected when calculating matrices of sizes:
14 x 14 (I understand this is small and does not make proper use of the GPU resources but this was purely for testing purposes to ensure results were correct)
118 x 118, and
300 x 300
However, when I am trying to calculate a matrix of size 2383 x 2383, the kernel crashes. Specifically, the error "Unspecified launch failure" is thrown on the cudaMemcpy() line to return results from device to host. From research, I understand that this error usually arises in the case of an out of bounds memory access (e.g. in an array), however, what I don't get is that it works for the three previous cases but not for the 2383 x 2383 case. The kernel code is shown below:
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
Global memory allocations are done via calls to cudaMalloc(). The allocations made in the code are as follows:
cudaStat1 = cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaStat2 = cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaStat3 = cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaStat4 = cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaStat5 = cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaStat6 = cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
cudaStat7 = cudaMalloc((void**)&dev_Pd, numberOfBuses*sizeof(float));
cudaStat8 = cudaMalloc((void**)&dev_Qd, numberOfBuses*sizeof(float));
cudaStat9 = cudaMalloc((void**)&dev_Vmag, numberOfBuses*sizeof(float));
cudaStat10 = cudaMalloc((void**)&dev_theta, numberOfBuses*sizeof(float));
cudaStat11 = cudaMalloc((void**)&dev_Peq, numberOfBuses*sizeof(float));
cudaStat12 = cudaMalloc((void**)&dev_Qeq, numberOfBuses*sizeof(float));
cudaStat13 = cudaMalloc((void**)&dev_Peq1, numberOfBuses*sizeof(float));
cudaStat14 = cudaMalloc((void**)&dev_Qeq1, numberOfBuses*sizeof(float));
...
...
cudaStat15 = cudaMalloc((void**)&dev_powerMismatch, jacSize*sizeof(float));
cudaStat16 = cudaMalloc((void**)&dev_jacobian, jacSize*jacSize*sizeof(float));
cudaStat17 = cudaMalloc((void**)&dev_stateVector, jacSize*sizeof(float));
cudaStat18 = cudaMalloc((void**)&dev_PQindex, jacSize*sizeof(int));
where cudaStatN are of type cudaError_t to catch errors. The last four allocations were done later on in the code and are for another kernel. However these allocations were done before the kernel in question was called.
The launch parameters are as follows:
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
//copy results back to CPU
cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
cout<<"Device memcpy failed"<<endl;
cout<<cudaGetErrorString(cudaStat6)<<endl;
return 1;
}
I removed the timing code just to show the block and grid dimensions and error checking technique used.
I also have a host (C++ code) version of this function and I'm passing the data to both functions and then comparing results, firstly, to ensure the kernel produces correct results, and secondly in terms of execution time to compare performance. I have double checked the data for the 2383 x 2383 case (it's being read in from a text file and copied to global memory) and I'm not finding any anomalies in array accesses/indexing.
I'm using Visual Studio 2010, so I tried using Nsight to find the error (I'm not too well-versed with Nsight). The summary report overview states: "There was 1 runtime API call error reported. (Please see the CUDA Runtime API Calls report for further information). In the list of runtime API calls, cudaMemcpy returns error 4 - not sure if the Thread ID (5012) is of any significance in the table - this number varies with every run. CUDA memcheck tool (in the command line) returns the following:
Thank you for using this program
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
=========
========= ERROR SUMMARY: 1 error
I know my kernel isn't the most efficient as there are many global memory accesses. Why is the kernel crashing for this larger matrix? Is there an out of bounds array access that I'm missing? Any assistance would be greatly appreciated.
Solved the problem. Turns out the WDDM TDR (timeout detecion recovery) was enabled and the delay was set to 2 seconds. This means that if the kernel execution time exceeds 2s, the driver will crash and recover. This is applicable to graphics and rendering (for general purpose uses of the GPU). In this case however, the TDR must either me disabled or the delay increased. By increasing the delay to 10s, the crash error "unspecified launch failure" ceased to appear and kernel execution continued as before.
The TDR delay (as well as enabling/disabling) can be done through Nsight options in the Nsight Monitor or through the Registry (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers) - DWORDS Tdrdelay and Tdrlevel.
I tried to reproduce your code with the following complete example. The code compiles, runs with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "cuComplex.h"
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
int main ()
{
int numLines = 32 ;
int numberOfBuses = 2383 ;
int* dev_fromBus, *dev_toBus;
float *dev_R, *dev_X, *dev_B;
cuComplex* dev_y ;
cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
cuComplex* y_bus = new cuComplex[numberOfBuses*numberOfBuses] ;
//copy results back to CPU
cudaError_t cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
printf ("failure : (%d) - %s\n", cudaStat6, ::cudaGetErrorString(cudaStat6)) ;
return 1;
}
return 0 ;
}
Your error seems to be somewhere else.
You want to run your code in NSIGHT debug mode with cuda mem check activated. If compiled with debug information, the tool should point out the location of your error.
EDIT: The problem appears to ne caused by WDDM TDR as discussed in comment.

CUDA : 2D grid launch error [duplicate]

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 256
#define NUM_BLOCK 256
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int incircle, k;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, *sumHost, total;
const char *error;
int trials;
curandState *devStates;
trials = 500;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));
cudaMalloc((void **) &sumDev, size); // Allocate array on device
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
dim3 dimGrid1(1,1,1);
dim3 dimBlock1(256,1,1);
reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
free(sumHost);
cudaFree(sumDev);
cudaFree(devStates);
//*solution = NULL;
return 0;
}
If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system's "watchdog" timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.
The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.
EDIT: According to the CUDA 4.0 curand docs(page 15, "Performance Notes"), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:
curandState state = states[i];
for(k = 0; k < trials; k++){
x = curand_uniform(&state);
y = curand_uniform(&state);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don't run up against the watchdog.
I recommend reading that section of the docs, there are several useful guidelines.
For those of you having a geforce GPU which does not support TCC driver there is another solution based on:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx
start regedit,
navigate to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
create new DWORD key called TdrLevel, set value to 0,
restart PC.
Now your long-running kernels should not be terminated. This answer is based on:
Modifying registry to increase GPU timeout, windows 7
I just thought it might be useful to provide the solution here as well.

cuda display driver stppoed [duplicate]

My monte carlo pi calculation CUDA program is causing my nvidia driver to crash when I exceed around 500 trials and 256 full blocks. It seems to be happening in the monteCarlo kernel function.Any help is appreciated.
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <curand.h>
#include <curand_kernel.h>
#define NUM_THREAD 256
#define NUM_BLOCK 256
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
// Function to sum an array
__global__ void reduce0(float *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_odata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) { // step = s x 2
if (tid % (2*s) == 0) { // only threadIDs divisible by the step participate
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
__global__ void monteCarlo(float *g_odata, int trials, curandState *states){
// unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int incircle, k;
float x, y, z;
incircle = 0;
curand_init(1234, i, 0, &states[i]);
for(k = 0; k < trials; k++){
x = curand_uniform(&states[i]);
y = curand_uniform(&states[i]);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
__syncthreads();
g_odata[i] = incircle;
}
///////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////
int main() {
float* solution = (float*)calloc(100, sizeof(float));
float *sumDev, *sumHost, total;
const char *error;
int trials;
curandState *devStates;
trials = 500;
total = trials*NUM_THREAD*NUM_BLOCK;
dim3 dimGrid(NUM_BLOCK,1,1); // Grid dimensions
dim3 dimBlock(NUM_THREAD,1,1); // Block dimensions
size_t size = NUM_BLOCK*NUM_THREAD*sizeof(float); //Array memory size
sumHost = (float*)calloc(NUM_BLOCK*NUM_THREAD, sizeof(float));
cudaMalloc((void **) &sumDev, size); // Allocate array on device
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
cudaMalloc((void **) &devStates, (NUM_THREAD*NUM_BLOCK)*sizeof(curandState));
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Do calculation on device by calling CUDA kernel
monteCarlo <<<dimGrid, dimBlock>>> (sumDev, trials, devStates);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// call reduction function to sum
reduce0 <<<dimGrid, dimBlock, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
dim3 dimGrid1(1,1,1);
dim3 dimBlock1(256,1,1);
reduce0 <<<dimGrid1, dimBlock1, (NUM_THREAD*sizeof(float))>>> (sumDev);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
// Retrieve result from device and store it in host array
cudaMemcpy(sumHost, sumDev, sizeof(float), cudaMemcpyDeviceToHost);
error = cudaGetErrorString(cudaGetLastError());
printf("%s\n", error);
*solution = 4*(sumHost[0]/total);
printf("%.*f\n", 1000, *solution);
free (solution);
free(sumHost);
cudaFree(sumDev);
cudaFree(devStates);
//*solution = NULL;
return 0;
}
If smaller numbers of trials work correctly, and if you are running on MS Windows without the NVIDIA Tesla Compute Cluster (TCC) driver and/or the GPU you are using is attached to a display, then you are probably exceeding the operating system's "watchdog" timeout. If the kernel occupies the display device (or any GPU on Windows without TCC) for too long, the OS will kill the kernel so that the system does not become non-interactive.
The solution is to run on a non-display-attached GPU and if you are on Windows, use the TCC driver. Otherwise, you will need to reduce the number of trials in your kernel and run the kernel multiple times to compute the number of trials you need.
EDIT: According to the CUDA 4.0 curand docs(page 15, "Performance Notes"), you can improve performance by copying the state for a generator to local storage inside your kernel, then storing the state back (if you need it again) when you are finished:
curandState state = states[i];
for(k = 0; k < trials; k++){
x = curand_uniform(&state);
y = curand_uniform(&state);
z =(x*x + y*y);
if (z <= 1.0f) incircle++;
}
Next, it mentions that setup is expensive, and suggests that you move curand_init into a separate kernel. This may help keep the cost of your MC kernel down so you don't run up against the watchdog.
I recommend reading that section of the docs, there are several useful guidelines.
For those of you having a geforce GPU which does not support TCC driver there is another solution based on:
http://msdn.microsoft.com/en-us/library/windows/hardware/ff569918(v=vs.85).aspx
start regedit,
navigate to HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers
create new DWORD key called TdrLevel, set value to 0,
restart PC.
Now your long-running kernels should not be terminated. This answer is based on:
Modifying registry to increase GPU timeout, windows 7
I just thought it might be useful to provide the solution here as well.

Cuda kernel producing the resultant vector as zero

Here is the kernel that I am launching for calculating some array in parallel.
__device__ bool mult(int colsize,int rowsize,int *Aj,int *Bi,int *val)
{
for(int j = 0; j < rowsize;j++)
{
for(int k = 0;k < colsize;k++)
{
if(Aj[j] == Bi[k])
{
return true;
}
}
}
return false;
}
__global__ void kernel(int *Aptr,int *Aj,int *Bptr,int *Bi,int rows,int cols,int *Cjc)
{
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int i;
if(tid < cols)
{
int beg = Bptr[tid];
int end = Bptr[tid+1];
for(i = 0;i < rows;i++)
{
int cbeg = Aptr[i];
int cend = Aptr[i+1];
if(mult(end - beg,cend - cbeg,Aj+cbeg,Bi+beg))
{
Cjc[tid+1] += 1;
//atomicAdd(Cjc+tid+1,1);
}
}
}
}
And here is how I decide the configuration of grid and blocks
int numBlocks,numThreads;
if(q % 32 == 0)
{
numBlocks = q/32;
numThreads = 32;
}
else
{
numBlocks = (q+31)/32;
numThreads = 32;
}
findkernel<<<numBlocks,numThreads>>>(devAptr,devAcol,devBjc,devBir,m,q,d_Cjc);
I am using GTX 480 with CC 2.0.
Now the problem that I am facing is that whenever q increases beyond 4096 the values in Cjc array are all produced as 0.
I know maximum number of blocks that I can use in X direction is 65535 and each block can have at most (1024,1024,64) threads. Then why does this kernel calculate the wrong output for Cjc array?
I seems like there are a couple of things wrong with the code you posted:
I guess findkernel is kernel in the CUDA code above?
kernel has 8 parameters, but you only use 7 parameters to call findkernel. This doesn't look right!
In kernel, you test if(tid < cols) - I guess this should be if(tid < count)??
Why does kernel expect count to be a pointer? I think you don't pass in an int pointer but a regular integer value to findkernel.
Why does __device__ bool mult get count/int *val if it is not used?
I guess #3 or #4 could be the source of your problem, but you should look at the other things as well.
OK so I finally figured out using cudaError_t that when I tried to cudaMemcpy the d_Cjc array from device to host, it throws following error.
CUDA error: the launch timed out and was terminated
It turns out that some of the calculations in findkernel are taking reasonably large amount of time which causes the display driver to terminate the program because of OS 'watchdog' time limit.
I believe I will have to shut down X server or ssh my gpu machine (from another machine) by removing its display.This will buy me some time to do the calculations that will not exceed the 'watchdog' limit of OS.