Overlap compute and transfer on Windows

Overlap compute and transfer on Windows - cuda

i'm facing some issues when trying to overlap computation and transferts on Windows (using VS2015 and CUDA 10.1). The code doesn't overlap at all. But the exact same code on Linux as the expected behaviour.
Here is the views from NVVP :
Windows 10 NVVP screenshot :
Linux NVVP screenshot :
Please note the following points :
my host memory is PageLocked
i'm using two different streams
i'm using cudaMemcpyAsync method to transfert between host and device
if i run my code on Linux, everything is fine
i don't see anything in the documentation describing a different behaviour between there two systems.
So the question is the following :
Am i missing something ?
Does it exists a way to achieve overlapping on this configuration (Windows 10 + 1080Ti)?
you can find some code here to reproduce the issue :
#include "cuda_runtime.h"
constexpr int NB_ELEMS = 64*1024*1024;
constexpr int BUF_SIZE = NB_ELEMS * sizeof(float);
constexpr int BLK_SIZE=1024;
using namespace std;
__global__
void dummy_operation(float* ptr1, float* ptr2)
{
const int idx = threadIdx.x + blockIdx.x * blockDim.x;
if(idx<NB_ELEMS)
{
float value = ptr1[idx];
for(int i=0; i<100; ++i)
{
value += 1.0f;
}
ptr2[idx] = value;
}
}
int main()
{
float *h_data1 = nullptr, *h_data2 = nullptr,
*h_data3 = nullptr, *h_data4 = nullptr;
cudaMallocHost(&h_data1, BUF_SIZE);
cudaMallocHost(&h_data2, BUF_SIZE);
cudaMallocHost(&h_data3, BUF_SIZE);
cudaMallocHost(&h_data4, BUF_SIZE);
float *d_data1 = nullptr, *d_data2 = nullptr,
*d_data3 = nullptr, *d_data4 = nullptr;
cudaMalloc(&d_data1, BUF_SIZE);
cudaMalloc(&d_data2, BUF_SIZE);
cudaMalloc(&d_data3, BUF_SIZE);
cudaMalloc(&d_data4, BUF_SIZE);
cudaStream_t st1, st2;
cudaStreamCreate(&st1);
cudaStreamCreate(&st2);
const dim3 threads(BLK_SIZE);
const dim3 blocks(NB_ELEMS / BLK_SIZE + 1);
for(int i=0; i<10; ++i)
{
float* tmp_dev_ptr = (i%2)==0? d_data1 : d_data3;
float* tmp_host_ptr = (i%2)==0? h_data1 : h_data3;
cudaStream_t tmp_st = (i%2)==0? st1 : st2;
cudaMemcpyAsync(tmp_dev_ptr, tmp_host_ptr, BUF_SIZE, cudaMemcpyDeviceToHost, tmp_st);
dummy_operation<<<blocks, threads, 0, tmp_st>>>(tmp_dev_ptr, d_data2);
//cudaMempcyAsync(d_data2, h_data2);
}
cudaStreamSynchronize(st1);
cudaStreamSynchronize(st2);
return 0;
}

As pointed by #talonmies, to overlap compute and transfers you need a graphic card in Tesla Compute Cluster mode.
I've checked this behaviour using an old Quadro P620.
[Edit] Overlapping between kernels and copy seems to be working since i applied the Windows 10 update 1909.
I'm not sure if the windows update has included an graphic driver update or not. But it's fine :)

Related

Large iteration loops in CUDA kernels causes memory error [duplicate]

I have a kernel to calculate different elements of a matrix, based on their position (diagonal or off-diagonal). The kernel works as expected when calculating matrices of sizes:
14 x 14 (I understand this is small and does not make proper use of the GPU resources but this was purely for testing purposes to ensure results were correct)
118 x 118, and
300 x 300
However, when I am trying to calculate a matrix of size 2383 x 2383, the kernel crashes. Specifically, the error "Unspecified launch failure" is thrown on the cudaMemcpy() line to return results from device to host. From research, I understand that this error usually arises in the case of an out of bounds memory access (e.g. in an array), however, what I don't get is that it works for the three previous cases but not for the 2383 x 2383 case. The kernel code is shown below:
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
Global memory allocations are done via calls to cudaMalloc(). The allocations made in the code are as follows:
cudaStat1 = cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaStat2 = cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaStat3 = cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaStat4 = cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaStat5 = cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaStat6 = cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
cudaStat7 = cudaMalloc((void**)&dev_Pd, numberOfBuses*sizeof(float));
cudaStat8 = cudaMalloc((void**)&dev_Qd, numberOfBuses*sizeof(float));
cudaStat9 = cudaMalloc((void**)&dev_Vmag, numberOfBuses*sizeof(float));
cudaStat10 = cudaMalloc((void**)&dev_theta, numberOfBuses*sizeof(float));
cudaStat11 = cudaMalloc((void**)&dev_Peq, numberOfBuses*sizeof(float));
cudaStat12 = cudaMalloc((void**)&dev_Qeq, numberOfBuses*sizeof(float));
cudaStat13 = cudaMalloc((void**)&dev_Peq1, numberOfBuses*sizeof(float));
cudaStat14 = cudaMalloc((void**)&dev_Qeq1, numberOfBuses*sizeof(float));
...
...
cudaStat15 = cudaMalloc((void**)&dev_powerMismatch, jacSize*sizeof(float));
cudaStat16 = cudaMalloc((void**)&dev_jacobian, jacSize*jacSize*sizeof(float));
cudaStat17 = cudaMalloc((void**)&dev_stateVector, jacSize*sizeof(float));
cudaStat18 = cudaMalloc((void**)&dev_PQindex, jacSize*sizeof(int));
where cudaStatN are of type cudaError_t to catch errors. The last four allocations were done later on in the code and are for another kernel. However these allocations were done before the kernel in question was called.
The launch parameters are as follows:
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
//copy results back to CPU
cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
cout<<"Device memcpy failed"<<endl;
cout<<cudaGetErrorString(cudaStat6)<<endl;
return 1;
}
I removed the timing code just to show the block and grid dimensions and error checking technique used.
I also have a host (C++ code) version of this function and I'm passing the data to both functions and then comparing results, firstly, to ensure the kernel produces correct results, and secondly in terms of execution time to compare performance. I have double checked the data for the 2383 x 2383 case (it's being read in from a text file and copied to global memory) and I'm not finding any anomalies in array accesses/indexing.
I'm using Visual Studio 2010, so I tried using Nsight to find the error (I'm not too well-versed with Nsight). The summary report overview states: "There was 1 runtime API call error reported. (Please see the CUDA Runtime API Calls report for further information). In the list of runtime API calls, cudaMemcpy returns error 4 - not sure if the Thread ID (5012) is of any significance in the table - this number varies with every run. CUDA memcheck tool (in the command line) returns the following:
Thank you for using this program
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
=========
========= ERROR SUMMARY: 1 error
I know my kernel isn't the most efficient as there are many global memory accesses. Why is the kernel crashing for this larger matrix? Is there an out of bounds array access that I'm missing? Any assistance would be greatly appreciated.

Solved the problem. Turns out the WDDM TDR (timeout detecion recovery) was enabled and the delay was set to 2 seconds. This means that if the kernel execution time exceeds 2s, the driver will crash and recover. This is applicable to graphics and rendering (for general purpose uses of the GPU). In this case however, the TDR must either me disabled or the delay increased. By increasing the delay to 10s, the crash error "unspecified launch failure" ceased to appear and kernel execution continued as before.
The TDR delay (as well as enabling/disabling) can be done through Nsight options in the Nsight Monitor or through the Registry (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers) - DWORDS Tdrdelay and Tdrlevel.

I tried to reproduce your code with the following complete example. The code compiles, runs with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "cuComplex.h"
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
int main ()
{
int numLines = 32 ;
int numberOfBuses = 2383 ;
int* dev_fromBus, *dev_toBus;
float *dev_R, *dev_X, *dev_B;
cuComplex* dev_y ;
cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
cuComplex* y_bus = new cuComplex[numberOfBuses*numberOfBuses] ;
//copy results back to CPU
cudaError_t cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
printf ("failure : (%d) - %s\n", cudaStat6, ::cudaGetErrorString(cudaStat6)) ;
return 1;
}
return 0 ;
}
Your error seems to be somewhere else.
You want to run your code in NSIGHT debug mode with cuda mem check activated. If compiled with debug information, the tool should point out the location of your error.
EDIT: The problem appears to ne caused by WDDM TDR as discussed in comment.

Why does my CUDA kernel crash (unspecified launch failure) with a different dataset size?

I have a kernel to calculate different elements of a matrix, based on their position (diagonal or off-diagonal). The kernel works as expected when calculating matrices of sizes:
14 x 14 (I understand this is small and does not make proper use of the GPU resources but this was purely for testing purposes to ensure results were correct)
118 x 118, and
300 x 300
However, when I am trying to calculate a matrix of size 2383 x 2383, the kernel crashes. Specifically, the error "Unspecified launch failure" is thrown on the cudaMemcpy() line to return results from device to host. From research, I understand that this error usually arises in the case of an out of bounds memory access (e.g. in an array), however, what I don't get is that it works for the three previous cases but not for the 2383 x 2383 case. The kernel code is shown below:
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
Global memory allocations are done via calls to cudaMalloc(). The allocations made in the code are as follows:
cudaStat1 = cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaStat2 = cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaStat3 = cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaStat4 = cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaStat5 = cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaStat6 = cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
cudaStat7 = cudaMalloc((void**)&dev_Pd, numberOfBuses*sizeof(float));
cudaStat8 = cudaMalloc((void**)&dev_Qd, numberOfBuses*sizeof(float));
cudaStat9 = cudaMalloc((void**)&dev_Vmag, numberOfBuses*sizeof(float));
cudaStat10 = cudaMalloc((void**)&dev_theta, numberOfBuses*sizeof(float));
cudaStat11 = cudaMalloc((void**)&dev_Peq, numberOfBuses*sizeof(float));
cudaStat12 = cudaMalloc((void**)&dev_Qeq, numberOfBuses*sizeof(float));
cudaStat13 = cudaMalloc((void**)&dev_Peq1, numberOfBuses*sizeof(float));
cudaStat14 = cudaMalloc((void**)&dev_Qeq1, numberOfBuses*sizeof(float));
...
...
cudaStat15 = cudaMalloc((void**)&dev_powerMismatch, jacSize*sizeof(float));
cudaStat16 = cudaMalloc((void**)&dev_jacobian, jacSize*jacSize*sizeof(float));
cudaStat17 = cudaMalloc((void**)&dev_stateVector, jacSize*sizeof(float));
cudaStat18 = cudaMalloc((void**)&dev_PQindex, jacSize*sizeof(int));
where cudaStatN are of type cudaError_t to catch errors. The last four allocations were done later on in the code and are for another kernel. However these allocations were done before the kernel in question was called.
The launch parameters are as follows:
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
//copy results back to CPU
cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
cout<<"Device memcpy failed"<<endl;
cout<<cudaGetErrorString(cudaStat6)<<endl;
return 1;
}
I removed the timing code just to show the block and grid dimensions and error checking technique used.
I also have a host (C++ code) version of this function and I'm passing the data to both functions and then comparing results, firstly, to ensure the kernel produces correct results, and secondly in terms of execution time to compare performance. I have double checked the data for the 2383 x 2383 case (it's being read in from a text file and copied to global memory) and I'm not finding any anomalies in array accesses/indexing.
I'm using Visual Studio 2010, so I tried using Nsight to find the error (I'm not too well-versed with Nsight). The summary report overview states: "There was 1 runtime API call error reported. (Please see the CUDA Runtime API Calls report for further information). In the list of runtime API calls, cudaMemcpy returns error 4 - not sure if the Thread ID (5012) is of any significance in the table - this number varies with every run. CUDA memcheck tool (in the command line) returns the following:
Thank you for using this program
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
=========
========= ERROR SUMMARY: 1 error
I know my kernel isn't the most efficient as there are many global memory accesses. Why is the kernel crashing for this larger matrix? Is there an out of bounds array access that I'm missing? Any assistance would be greatly appreciated.

Solved the problem. Turns out the WDDM TDR (timeout detecion recovery) was enabled and the delay was set to 2 seconds. This means that if the kernel execution time exceeds 2s, the driver will crash and recover. This is applicable to graphics and rendering (for general purpose uses of the GPU). In this case however, the TDR must either me disabled or the delay increased. By increasing the delay to 10s, the crash error "unspecified launch failure" ceased to appear and kernel execution continued as before.
The TDR delay (as well as enabling/disabling) can be done through Nsight options in the Nsight Monitor or through the Registry (HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers) - DWORDS Tdrdelay and Tdrlevel.

I tried to reproduce your code with the following complete example. The code compiles, runs with no error.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include "cuComplex.h"
__global__ void createYBus(float *R, float *X, float *B, int numberOfBuses, int numberOfBranches, int *fromBus, int *toBus, cuComplex *y)
{
int rowIdx = blockIdx.y*blockDim.y + threadIdx.y;
int colIdx = blockIdx.x*blockDim.x + threadIdx.x;
int index = rowIdx*numberOfBuses + colIdx;
if (rowIdx<numberOfBuses && colIdx<numberOfBuses)
{
for (int i=0; i<numberOfBranches; ++i)
{
if (rowIdx==fromBus[i] && colIdx==fromBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==toBus[i] && colIdx==toBus[i]) { //diagonal element
y[index] = cuCaddf(y[index], make_cuComplex((R[i]/((R[i]*R[i])+(X[i]*X[i]))), (-(X[i]/((R[i]*R[i])+(X[i]*X[i])))+ (B[i]/2))));
}
if (rowIdx==fromBus[i] && colIdx==toBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
if (rowIdx==toBus[i] && colIdx==fromBus[i]) { //off-diagonal element
y[index] = make_cuComplex(-(R[i]/((R[i]*R[i])+(X[i]*X[i]))), X[i]/((R[i]*R[i])+(X[i]*X[i])));
}
}
}
}
int main ()
{
int numLines = 32 ;
int numberOfBuses = 2383 ;
int* dev_fromBus, *dev_toBus;
float *dev_R, *dev_X, *dev_B;
cuComplex* dev_y ;
cudaMalloc((void**)&dev_fromBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_toBus, numLines*sizeof(int));
cudaMalloc((void**)&dev_R, numLines*sizeof(float));
cudaMalloc((void**)&dev_X, numLines*sizeof(float));
cudaMalloc((void**)&dev_B, numLines*sizeof(float));
cudaMalloc((void**)&dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex));
dim3 dimBlock(16, 16); //number of threads
dim3 dimGrid((numberOfBuses+15)/16, (numberOfBuses+15)/16); //number of blocks
//launch kernel once data has been copied to GPU
createYBus<<<dimGrid, dimBlock>>>(dev_R, dev_X, dev_B, numberOfBuses, numLines, dev_fromBus, dev_toBus, dev_y);
cuComplex* y_bus = new cuComplex[numberOfBuses*numberOfBuses] ;
//copy results back to CPU
cudaError_t cudaStat6 = cudaMemcpy(y_bus, dev_y, numberOfBuses*numberOfBuses*sizeof(cuComplex), cudaMemcpyDeviceToHost);
if (cudaStat6 != cudaSuccess) {
printf ("failure : (%d) - %s\n", cudaStat6, ::cudaGetErrorString(cudaStat6)) ;
return 1;
}
return 0 ;
}
Your error seems to be somewhere else.
You want to run your code in NSIGHT debug mode with cuda mem check activated. If compiled with debug information, the tool should point out the location of your error.
EDIT: The problem appears to ne caused by WDDM TDR as discussed in comment.

Shared Memory of Objects in CUDA and libc++abi.dylib error

I have the following problem (keep in mind that I am fairly new to programming with CUDA),
I have a class called vec3f that is just like the float3 data type but with overloaded operators, and other vector functions. These functions are prefixed with __ device __ __ host __ (i added spaces because it was making these words bolded). Then, in my kernel I do a nested for loop over block_x and block_y indicies and do something like,
//set up shared memory block
extern __shared__ vec3f share[];
vec3f *sh_pos = share;
vec3f *sh_velocity = &sh_pos[blockDim.x*blockDim.y];
sh_pos[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].position();
sh_velocity[blockDim.x * threadIdx.x + threadIdx.y] = oldParticles[index].velocity();
__syncthreads();
In the above code, oldParticles is a pointer to a class called particles that is being passed to the kernel. OldParticles is acutally an underlying pointer of a thrust::device_vector (im not sure if this has something to do with it). Everything compiles okay but when I run I get the error
libc++abi.dylib: terminate called throwing an exception
Abort trap: 6
Thanks for the replies. I think the error had to do with me not allocating room for the arguments being passed to my kernel. Doing the following in my host code fixed this error,
particle* particle_ptrs[2];
particle_ptrs[0] = thrust::raw_pointer_cast(&d_old_particles[0]);
particle_ptrs[1] = thrust::raw_pointer_cast(&d_new_particles[0]);
CUDA_SAFE_CALL( cudaMalloc( (void**)&particle_ptrs[0], max_particles * sizeof(particle) ) );
CUDA_SAFE_CALL( cudaMalloc( (void**)&particle_ptrs[1], max_particles * sizeof(particle) ) );
The kernel call is then,
force_kernel<<< grid,block,sharedMemSize >>>(particle_ptrs[0],particle_ptrs[1],time_step);
The issue that I am having now seems to be that I can't get data copied back to the host from the device. I think this has to do with me not being familiar with thrust.
Im doing a series of copies as follows,
//make a host vector assume this is initialized
thrust::host_vector<particle> h_particles;
thrust::device_vector<particle> d_old_particles, d_new_particles;
d_old_particles = h_particles;
//launch kernel as shown above
//with thrust vectors having been casted into their underlying pointers
//particle_ptrs[1] gets modified and so shouldnt d_new_particles?
//copy back
h_particles = d_new_particles;
So I guess my question is, can I modify a thrust device vector in a kernel (in this case particle_pters[0]) save the modification to to another thrust device vector in the kernel (in this case particle_pters[1]) and then once I exit from the kernel, copy that to a host vector?
I still can't get this to work. I made a shorter example where I am having the same problem,
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include "vec3f.h"
const int BLOCK_SIZE = 8;
const int max_particles = 64;
const float dt = 0.01;
using namespace std;
//particle class
class particle {
public:
particle() :
_velocity(vec3f(0,0,0)), _position(vec3f(0,0,0)), _density(0.0) {
};
particle(const vec3f& pos, const vec3f& vel) :
_position(pos), _velocity(vel), _density(0.0) {
};
vec3f _velocity;
vec3f _position;
float _density;
};
//forward declaration of kernel func
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt);
//global thrust vectors
thrust::host_vector<particle> h_parts;
thrust::device_vector<particle> old_parts, new_parts;
particle* particle_ptrs[2];
int main() {
//load host vector
for (int i =0; i<max_particles; i++) {
h_parts.push_back(particle(vec3f(0.5,0.5,0.5),vec3f(10,10,10)));
}
particle_ptrs[0] = thrust::raw_pointer_cast(&old_parts[0]);
particle_ptrs[1] = thrust::raw_pointer_cast(&new_parts[0]);
cudaMalloc( (void**)&particle_ptrs[0], max_particles * sizeof(particle) );
cudaMalloc( (void**)&particle_ptrs[1], max_particles * sizeof(particle) );
//copy host particles to old device particles...
old_parts = h_parts;
//kernel block and grid dimensions
dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
dim3 grid(int(sqrt(float(max_particles) / (float(block.x*block.y)))), int(sqrt(float(max_particles) / (float(block.x*block.y)))), 1);
kernel_func<<<block,grid>>>(particle_ptrs[0],particle_ptrs[1],dt);
//copy new device particles back to host particles
h_parts = new_parts;
for (int i =0; i<max_particles; i++) {
particle temp1 = h_parts[i];
cout << temp1._position << endl;
}
//delete thrust device vectors
old_parts.clear();
old_parts.shrink_to_fit();
new_parts.clear();
new_parts.shrink_to_fit();
return 0;
}
//kernel function
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt) {
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
//get array position for 2d grid...
unsigned int arr_pos = y*blockDim.x*gridDim.x + x;
new_parts[arr_pos]._velocity = old_parts[arr_pos]._velocity * 10.0 * dt;
new_parts[arr_pos]._position = old_parts[arr_pos]._position * 10.0 * dt;
new_parts[arr_pos]._density = old_parts[arr_pos]._density * 10.0 * dt;
}
So the host vector has an initial position of (0.5,0.5,0.5) for all 64 particles. Then the kernel attempts to multiply that by 10 to give (5,5,5) as the position for all particles. But I dont see this when I "cout" the data. It is still just (0.5,0.5,0.5). Is there a problem with how I am allocating memory? Is there a problem with the lines:
//copy new device particles back to host particles
h_parts = new_parts;
What could be the issue? Thank you.

There are various problems with the code you have posted.
you have your block and grid variables reversed in your kernel invocation. grid comes first.
you should be doing cuda error checking on your kernel and runtime API calls.
your method of allocating storage using cudaMalloc on a pointer which has been raw-cast from an empty device vector is not sensible. The vector container has no knowledge that you did this "under the hood." Instead, you can directly allocate storage for the device vector when you instantiate it, like:
thrust::device_vector<particle> old_parts(max_particles), new_parts(max_particles);
You say you're expecting 5,5,5, but your kernel is multiplying by 10 and then by dt which is 0.01, so I believe the correct output is 0.05, 0.05, 0.05
Your grid computation (int(sqrt...)), for an arbitrary max_particles either is not guaranteed to produce enough blocks (if casting a float to int truncates or rounds down) or will produce extra blocks (if it rounds up). The round down case is bad. We should handle that by using a ceil function or another grid computation method. The round up case (which is what ceil will do) is OK, but we need to handle the fact that the grid may launch extra blocks/threads. We do that with a thread check in the kernel. There were other problems with the grid computation as well. We want to take the square root of max_particles, then divide it by the block dimension in a particular direction, to get the grid dimension in that direction.
Here's some code that I've modified with these changes in mind, it seems to produce the correct output (0.05, 0.05, 0.05). Note that I had to make some other changes because I don't have your "vec3f.h" header file handy, so I used float3 instead.
#include <iostream>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <vector_functions.h>
const int BLOCK_SIZE = 8;
const int max_particles = 64;
const float dt = 0.01;
using namespace std;
//particle class
class particle {
public:
particle() :
_velocity(make_float3(0,0,0)), _position(make_float3(0,0,0)), _density(0.0)
{
};
particle(const float3& pos, const float3& vel) :
_position(pos), _velocity(vel), _density(0.0)
{
};
float3 _velocity;
float3 _position;
float _density;
};
//forward declaration of kernel func
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt);
int main() {
//global thrust vectors
thrust::host_vector<particle> h_parts;
particle* particle_ptrs[2];
//load host vector
for (int i =0; i<max_particles; i++) {
h_parts.push_back(particle(make_float3(0.5,0.5,0.5),make_float3(10,10,10)));
}
//copy host particles to old device particles...
thrust::device_vector<particle> old_parts = h_parts;
thrust::device_vector<particle> new_parts(max_particles);
particle_ptrs[0] = thrust::raw_pointer_cast(&old_parts[0]);
particle_ptrs[1] = thrust::raw_pointer_cast(&new_parts[0]);
//kernel block and grid dimensions
dim3 block(BLOCK_SIZE,BLOCK_SIZE,1);
dim3 grid((int)ceil(sqrt(float(max_particles)) / (float(block.x))), (int)ceil(sqrt(float(max_particles)) / (float(block.y))), 1);
cout << "grid x: " << grid.x << " grid y: " << grid.y << endl;
kernel_func<<<grid,block>>>(particle_ptrs[0],particle_ptrs[1],dt);
//copy new device particles back to host particles
cudaDeviceSynchronize();
h_parts = new_parts;
for (int i =0; i<max_particles; i++) {
particle temp1 = h_parts[i];
cout << temp1._position.x << "," << temp1._position.y << "," << temp1._position.z << endl;
}
//delete thrust device vectors
old_parts.clear();
old_parts.shrink_to_fit();
new_parts.clear();
new_parts.shrink_to_fit();
return 0;
}
//kernel function
__global__ void kernel_func(particle* old_parts, particle* new_parts, float dt) {
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
//get array position for 2d grid...
unsigned int arr_pos = y*blockDim.x*gridDim.x + x;
if (arr_pos < max_particles) {
new_parts[arr_pos]._velocity.x = old_parts[arr_pos]._velocity.x * 10.0 * dt;
new_parts[arr_pos]._velocity.y = old_parts[arr_pos]._velocity.y * 10.0 * dt;
new_parts[arr_pos]._velocity.z = old_parts[arr_pos]._velocity.z * 10.0 * dt;
new_parts[arr_pos]._position.x = old_parts[arr_pos]._position.x * 10.0 * dt;
new_parts[arr_pos]._position.y = old_parts[arr_pos]._position.y * 10.0 * dt;
new_parts[arr_pos]._position.z = old_parts[arr_pos]._position.z * 10.0 * dt;
new_parts[arr_pos]._density = old_parts[arr_pos]._density * 10.0 * dt;
}
}

CUDA atomic function usage with volatile shared memory

I have a CUDA kernel that needs to use an atomic function on volatile shared integer memory. However, when I try to declare the shared memory as volatile and use it in an atomic function, I get an error message.
Below is some minimalist code that reproduces the error. Please note that the following kernel does nothing and horribly abuses why you would ever want to declare shared memory as volatile (or even use shared memory at all). But it does reproduce the error.
The code uses atomic functions on shared memory, so, to run it, you probably need to compile with "arch12" or higher (in Visual Studio 2010, right click on your project and go to "Properties -> Configuration Properties -> CUDA C/C++ -> Device" and enter "compute_12,sm_12" in the "Code Generation" line). The code should otherwise compile as is.
#include <cstdlib>
#include <cuda_runtime.h>
static int const X_THRDS_PER_BLK = 32;
static int const Y_THRDS_PER_BLK = 8;
__global__ void KernelWithSharedMemoryAndAtomicFunction(int * d_array, int numTotX, int numTotY)
{
__shared__ int s_blk[Y_THRDS_PER_BLK][X_THRDS_PER_BLK]; // compiles
//volatile __shared__ int s_blk[Y_THRDS_PER_BLK][X_THRDS_PER_BLK]; // will not compile
int tx = threadIdx.x;
int ty = threadIdx.y;
int mx = blockIdx.x*blockDim.x + threadIdx.x;
int my = blockIdx.y*blockDim.y + threadIdx.y;
int mi = my*numTotX + mx;
if (mx < numTotX && my < numTotY)
{
s_blk[ty][tx] = d_array[mi];
__syncthreads();
atomicMin(&s_blk[ty][tx], 4); // will compile with volatile shared memory only if this line is commented out
__syncthreads();
d_array[mi] = s_blk[ty][tx];
}
}
int main(void)
{
// Declare and initialize some array on host
int const NUM_TOT_X = 4*X_THRDS_PER_BLK;
int const NUM_TOT_Y = 6*Y_THRDS_PER_BLK;
int * h_array = (int *)malloc(NUM_TOT_X*NUM_TOT_Y*sizeof(int));
for (int i = 0; i < NUM_TOT_X*NUM_TOT_Y; ++i) h_array[i] = i;
// Copy array to device
int * d_array;
cudaMalloc((void **)&d_array, NUM_TOT_X*NUM_TOT_Y*sizeof(int));
cudaMemcpy(d_array, h_array, NUM_TOT_X*NUM_TOT_Y*sizeof(int), cudaMemcpyHostToDevice);
// Declare block and thread variables
dim3 thdsPerBlk;
dim3 blks;
thdsPerBlk.x = X_THRDS_PER_BLK;
thdsPerBlk.y = Y_THRDS_PER_BLK;
thdsPerBlk.z = 1;
blks.x = (NUM_TOT_X + X_THRDS_PER_BLK - 1)/X_THRDS_PER_BLK;
blks.y = (NUM_TOT_Y + Y_THRDS_PER_BLK - 1)/Y_THRDS_PER_BLK;
blks.z = 1;
// Run kernel
KernelWithSharedMemoryAndAtomicFunction<<<blks, thdsPerBlk>>>(d_array, NUM_TOT_X, NUM_TOT_Y);
// Cleanup
free (h_array);
cudaFree(d_array);
return 0;
}
Anyway, if you comment out the "s_blk" declaration towards the top of the kernel and uncomment the commented-out declaration immediately following it, then you should get the following error:
error : no instance of overloaded function "atomicMin" matches the argument list
I do not understand why declaring the shared memory as volatile would affect its type, as (I think) this error message is indicating, nor why it cannot be used with atomic operations.
Can anyone please provide any insight?
Thanks,
Aaron

Just replace
atomicMin(&s_blk[ty][tx], 4);
by
atomicMin((int *)&s_blk[ty][tx], 4);.
It typecasts &s_blk[ty][tx] so it matches the argument of atomicMin(..).

CUDA: Fermi (Tesla M2090) generating CUDA_EXCEPTION_10 without reason

I have a small piece of code which runs perfectly on Nvidia old architecture (Tesla T10 processor) but not on Fermi (Tesla M2090)
I learned that Fermi behaves slightly differently. Due to which unsafe code might work correctly on old architectures, while on Fermi it catches the bug.
But I don't know how to resolve it.
Here is my code:
__global__ void exec (int *arr_ptr, int size, int *result) {
int tx = threadIdx.x;
int ty = threadIdx.y;
*result = arr_ptr[-2];
}
void run(int *arr_dev, int size, int *result) {
cudaStream_t stream = 0;
int *arr_ptr = arr_dev + 5;
dim3 threads(1,1,1);
dim3 grid (1,1);
exec<<<grid, threads, 0, stream>>>(arr_ptr, size, result);
}
since I am accessing arr_ptr[-2], the fermi throws CUDA_EXCEPTION_10, Device Illegal Address. But it is not. The address is legal.
Can anyone help me on this.
My driver code is
int main(){
int *arr;
int *arr_dev = NULL;
int result = 1;
arr = (int*)malloc(10*sizeof(int));
for(int i = 0; i < 10; i++)
arr[i] = i;
if(arr_dev == NULL)
{
cudaMalloc((void**)&arr_dev, 10);
cudaMemcpy(arr_dev, arr, 10*sizeof(int), cudaMemcpyHostToDevice);
}
run(arr_dev, 10, &result);
printf("%d \n", result);
return 0;
}

Fermi cards have much better memory protection on the device and will detect out of bounds conditions which will appear to "work" on older cards. Use cuda-memchk (or the cuda-memchk mode in cuda-gdb) to get a better handle on what is going wrong.
EDIT:
This is the culprit:
cudaMalloc((void**)&arr_dev, 10);
which should be
cudaMalloc((void**)&arr_dev, 10*sizeof(int));
This will result in this code
int *arr_ptr = arr_dev + 5;
passing a pointer to the device which is out of bounds.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Overlap compute and transfer on Windows - cuda

Related

Large iteration loops in CUDA kernels causes memory error [duplicate]

Why does my CUDA kernel crash (unspecified launch failure) with a different dataset size?

Shared Memory of Objects in CUDA and libc++abi.dylib error

CUDA atomic function usage with volatile shared memory

CUDA: Fermi (Tesla M2090) generating CUDA_EXCEPTION_10 without reason

Categories

Resources