float2 matrix (as 1D array) and CUDA - cuda

I have to work with a float2 matrix as a 1D array. I wanted to check some things and I have written this code:
#include <stdio.h>
#include <stdlib.h>
#define index(x,y) x+y*N
__global__ void test(float2* matrix_CUDA,int N)
{
int i,j;
i=blockIdx.x*blockDim.x+threadIdx.x;
j=blockIdx.y*blockDim.y+threadIdx.y;
matrix_CUDA[index(i,j)].x=i;
matrix_CUDA[index(i,j)].y=j;
}
int main()
{
int N=256;
int i,j;
//////////////////////////////////////////
float2* matrix;
matrix=(float2*)malloc(N*N*sizeof(float2));
//////////////////////////////////////////
float2* matrix_CUDA;
cudaMalloc((void**)&matrix_CUDA,N*N*sizeof(float2));
//////////////////////////////////////////
dim3 block_dim(32,2,0);
dim3 grid_dim(2,2,0);
test <<< grid_dim,block_dim >>> (matrix_CUDA,N);
//////////////////////////////////////////
cudaMemcpy(matrix,matrix_CUDA,N*N*sizeof(float2),cudaMemcpyDeviceToHost);
for(i=0;i<N;i++)
{
for(j=0;j<N;j++)
{
printf("%d %d, %f %f\n",i,j,matrix[index(i,j)].x,matrix[index(i,j)].y);
}
}
return 0;
}
I was waiting for a output like:
0 0, 0 0
0 1, 0 1
0 2, 0 2
0 3, 0 3
...
But the thing I find is:
0 0, -nan 7.265723657
0 1, -nan 152345
0 2, 25.2135235 -nan
0 3, 52354.324534 24.52354234523
...
That means I have some problems with the memory allocation (I suppose) but I can't find what is wrong with my code. Could someone help me?

Any time you are having trouble with a CUDA code, you should always use proper CUDA error checking and run your code with cuda-memcheck, before asking for help.
Even if you don't understand the output, it will be useful to others trying to help you.
If you had run this code with cuda-memcheck, you would have gotten (amongst all your other output!) some output like this:
$ cuda-memcheck ./t1273
========= CUDA-MEMCHECK
========= Program hit cudaErrorInvalidConfiguration (error 9) due to "invalid configuration argument" on CUDA API call to cudaLaunch.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/lib64/libcuda.so.1 [0x2eea03]
========= Host Frame:./t1273 [0x3616e]
========= Host Frame:./t1273 [0x2bfd]
========= Host Frame:./t1273 [0x299a]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21b15]
========= Host Frame:./t1273 [0x2a5d]
=========
========= ERROR SUMMARY: 1 error
$
This means something is wrong with the way you configured your kernel launch:
dim3 block_dim(32,2,0);
dim3 grid_dim(2,2,0);
test <<< grid_dim,block_dim >>> (matrix_CUDA,N);
^^^^^^^^^^^^^^^^^^
kernel config arguments
Specifically, you do not ever select a dimension of zero when creating a dim3 variable for kernel launch. The minimum dimension for any component is 1, not zero.
So use arguments like this:
dim3 block_dim(32,2,1);
dim3 grid_dim(2,2,1);
In addition, once you fix that, you still find that many of your outputs are not touched by your code. To fix that, you'll need to increase the size of your thread array to match the size of your data array. Since you have a 1-D array, it's not really clear to me why you are launching 2D threadblocks and 2D grids. Your data array should be completely "coverable" with a total of 65536 threads in a linear dimension, something like this:
dim3 block_dim(32,1,1);
dim3 grid_dim(2048,1,1);

Related

Copy class data allocated on device back to host

In my code I want to allocate memory for a pointer data member of a class during kernel execution and write to it afterwards. Then I want to get this data on the host later. In my approach, however, I don't get the right data on the host (see below). Is my approach completely off or can you spot the erroneous part?
#include <cuda_runtime.h>
#include <stdio.h>
class OutputData {
public:
int *data;
};
__global__ void init(OutputData *buffer)
{
// allocate memory for data
buffer->data = (int*) malloc(sizeof(int)*2);
// write data
buffer->data[0] = 1;
buffer->data[1] = 2;
}
int main(int argc, char **argv)
{
// malloc device memory
OutputData *d_buffer;
cudaMalloc(&d_buffer, sizeof(OutputData));
// run kernel
init<<<1,1>>>(d_buffer);
cudaDeviceSynchronize();
// malloc host memory
OutputData *h_buffer = (OutputData*) malloc(sizeof(OutputData));
//transfer data from device to host
cudaMemcpy(h_buffer, d_buffer, sizeof(OutputData), cudaMemcpyDeviceToHost);
int* h_data = (int*) malloc(sizeof(int)*2);
cudaMemcpy(h_data, h_buffer->data, sizeof(int)*2, cudaMemcpyDeviceToHost);
// Print the data
printf("h_data[0] = %d, h_data[1] = %d\n", h_data[0], h_data[1]);
// free memory
cudaFree(h_buffer->data);
free(h_buffer);
cudaFree(d_buffer);
free(h_data);
return (0);
}
The output is
h_data[0] = 0, h_data[1] = 0
and not
h_data[0] = 1, h_data[1] = 2
as expected.
As per the documentation:
In addition, device malloc() memory cannot be used in any runtime or driver API calls (i.e. cudaMemcpy, cudaMemset, etc).
To confirm this, let's run your code with cuda-memcheck:
$ nvcc -std=c++11 -arch=sm_52 -o heapcopy heapcopy.cu
$ cuda-memcheck ./heapcopy
========= CUDA-MEMCHECK
h_data[0] = 36791296, h_data[1] = 0
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3451c3]
========= Host Frame:./heapcopy [0x3cb0a]
========= Host Frame:./heapcopy [0x31ac]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
========= Host Frame:./heapcopy [0x2fd9]
=========
========= Program hit cudaErrorInvalidDevicePointer (error 17) due to "invalid device pointer" on CUDA API call to cudaFree.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3451c3]
========= Host Frame:./heapcopy [0x44f00]
========= Host Frame:./heapcopy [0x31dc]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
========= Host Frame:./heapcopy [0x2fd9]
=========
========= ERROR SUMMARY: 2 errors
This is why your code fails -- the address at h_buffer->data is not host API accessible. Note also that it can't be free'd from the host either.
You could do something like this, which uses a managed memory allocation as the host memory (so it is directly accessible within the kernel), and a device side cudaMemcpyAsync call:
#include <cuda_runtime.h>
#include <stdio.h>
class OutputData {
public:
int *data;
};
__global__ void init(OutputData *buffer)
{
// allocate memory for data
buffer->data = (int*) malloc(sizeof(int)*2);
// write data
buffer->data[0] = 1;
buffer->data[1] = 2;
}
__global__ void deepcopy(OutputData* dest, OutputData* source, size_t datasz)
{
cudaMemcpyAsync(dest->data, source->data, datasz, cudaMemcpyDeviceToDevice);
}
int main(int argc, char **argv)
{
// malloc device memory
OutputData *d_buffer;
cudaMalloc(&d_buffer, sizeof(OutputData));
// run kernel
init<<<1,1>>>(d_buffer);
cudaDeviceSynchronize();
// malloc host memory as managed memory
//OutputData *h_buffer = (OutputData*) malloc(sizeof(OutputData));
//int* h_data = (int*) malloc(sizeof(int)*2);
size_t dsize = sizeof(int)*2;
OutputData* h_buffer; cudaMallocManaged(&h_buffer, sizeof(OutputData));
int* h_data; cudaMallocManaged(&h_data, dsize);
h_buffer->data = h_data;
// run kernel
deepcopy<<<1,1>>>(h_buffer, d_buffer, dsize);
cudaDeviceSynchronize();
// Print the data
printf("h_data[0] = %d, h_data[1] = %d\n", h_data[0], h_data[1]);
// free memory
cudaFree(h_data);
cudaFree(h_buffer);
cudaFree(d_buffer);
return (0);
}
Which runs as expected (note there is technically a device heap memory leak here because a device side free call is never made):
$ nvcc -std=c++11 -arch=sm_52 -dc -o heapcopy.o heapcopy.cu
$ nvcc -std=c++11 -arch=sm_52 -o heapcopy heapcopy.o
$ cuda-memcheck ./heapcopy
========= CUDA-MEMCHECK
h_data[0] = 1, h_data[1] = 2
========= ERROR SUMMARY: 0 errors
There are other variations (like building a complete mirror structure of the heap structure in global memory from the host and then running the copy kernel), but those make even less sense than this does.

Check failed: error == cudaSuccess (77 vs. 0) an illegal memory access was encountered [duplicate]

This question already has an answer here:
CUDA - invalid device function, how to know [architecture, code]?
(1 answer)
Closed 2 years ago.
I'm debugging some lengthy code which involves some cuda operations.
I' currently getting the above mentioned error during a call to cudaMemcpy(...,...,cudaMemcpyHostToDevice) but I'm not sure it is speficially related to that.
Here is a code snippet:
int num_elements = 8294400; // --> I also tried it with "1" here which didn't work either!
float *checkArray = new float[num_elements];
float *checkArray_GPU;
CUDA_CHECK(cudaMalloc(&checkArray_GPU, num_elements * sizeof(float)));
CUDA_CHECK(cudaMemcpy(checkArray_GPU, checkArray, num_elements * sizeof(float), cudaMemcpyHostToDevice));
CUDA_CHECK(cudaMemcpy(checkArray, checkArray_GPU, num_elements * sizeof(float), cudaMemcpyDeviceToHost));
where CUDA_CHECK is simply a macro for printing any cuda error (this was part of the existing code and works fine for all other cudaMemcpy oder cudaMalloc calls so it is not part of the problem). Strangely this code snippet executed separately in a toy *.cu example works fine.
So my assumption is that due to previous cuda operations in the program, there have been some errors which have not been reported that cause the bug in the code snippet above. Could that be?
Is there a way to check if there is some unreported error involving cuda?
My other estimate is that it might come from the specific graphic card I'm using. I have a Nvidia Titan X Pascal, Cuda 8.0 and cudnn v5.1. I also tried to compile my code using some special compiler flags like
-arch=sm_30 \
-gencode=arch=compute_20,code=sm_20 \
-gencode=arch=compute_30,code=sm_30 \
-gencode=arch=compute_50,code=sm_50 \
-gencode=arch=compute_52,code=sm_52 \
-gencode=arch=compute_52,code=compute_52 \
-gencode=arch=compute_60,code=sm_60 \
-gencode=arch=compute_61,code=sm_61 \
-gencode=arch=compute_62,code=sm_62 \
but it didn't help so far. Here is my current simplified Makefile for completeness:
NVCC = nvcc
CUDA_INC = -I/usr/local/cuda/include
CUDA_LIB = -L/usr/local/cuda/lib64
TARGET = myProgramm
OPTS = -std=c++11
$(TARGET).so: $(TARGET).o
$(NVCC) $(OPTS) -shared $(TARGET).o $(CUDA_LIB) -o $(TARGET).so
$(TARGET).o: $(TARGET).cu headers/some_header.hpp
$(NVCC) $(OPTS) $(CUDA_INC) -Xcompiler -fPIC -c $(TARGET).cu
Has anyone an idea how I could get to the bottom of this?
Edit:
cuda-memcheck was a good idea, so the error apparantly happens earlier during a call of Kernel_set_value:
========= Invalid __global__ write of size 4
========= at 0x00000298 in void Kernel_set_value<float>(unsigned long, unsigned long, float*, float)
========= by thread (480,0,0) in block (30,0,0)
========= Address 0x0005cd00 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x209035]
[...]
========= Host Frame:/media/.../myProgramm.so (_ZN5boost6python6detail6invokeIiPFvRKSsENS0_15arg_from_pythonIS4_EEEEP7_objectNS1_11invoke_tag_ILb1ELb0EEERKT_RT0_RT1_ + 0x2d) [0x3e5eb]
[...]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaMemcpy.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x2f4e33]
========= Host Frame:/media/.../myProgramm.so [0x7489f]
F0703 16:23:54.840698 26207 myProgramm.cu:411] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
[...]
========= Host Frame:python (Py_Main + 0xb5e) [0x66d92]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21f45]
========= Host Frame:python [0x177c2e]
=========
*** Check failure stack trace: ***
========= Error: process didn't terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found
but also the function Kernel_set_value works fine in a toy example. Is there anything special to consider when using Kernel_set_value. This is it's source code and it's respective helper functions.
#define CUDA_NUM_THREADS 512
#define MAX_NUM_BLOCKS 2880
inline int CUDA_GET_BLOCKS(const size_t N) {
return min(MAX_NUM_BLOCKS, int((N + size_t(CUDA_NUM_THREADS) - 1) / CUDA_NUM_THREADS));
}
inline size_t CUDA_GET_LOOPS(const size_t N) {
size_t total_threads = CUDA_GET_BLOCKS(N)*CUDA_NUM_THREADS;
return (N + total_threads -1)/ total_threads;
}
template <typename Dtype>
__global__ void Kernel_set_value(size_t CUDA_NUM_LOOPS, size_t N, Dtype* GPUdst, Dtype value){
const size_t idxBase = size_t(CUDA_NUM_LOOPS) * (size_t(CUDA_NUM_THREADS) * size_t(blockIdx.x) + size_t(threadIdx.x));
if (idxBase >= N) return;
for (size_t idx = idxBase; idx < min(N,idxBase+CUDA_NUM_LOOPS); ++idx ){
GPUdst[idx] = value;
}
}
So the final solution was to compile the code without any -gencode=arch=compute_XX,code=sm_XX-style flags. Took me forever to find this out. The actual error codes were very missleading (error == cudaSuccess (77 vs. 0) an illegal memory access, (4 vs. 0) unspecified launch failure or (8 vs. 0) invalid device function

Error with 'cuda-memcheck' in cuda 8.0

It is strange that when I do not add cuda-memcheck before ./main, the program runs without any warning or error message, however, when I add it, it will have error message like following.
========= Invalid __global__ write of size 8
========= at 0x00000120 in initCurand(curandStateXORWOW*, unsigned long)
========= by thread (9,0,0) in block (3,0,0)
========= Address 0x5005413b0 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2c5) [0x204115]
========= Host Frame:./main [0x18e11]
========= Host Frame:./main [0x369b3]
========= Host Frame:./main [0x3403]
========= Host Frame:./main [0x308c]
========= Host Frame:./main [0x30b7]
========= Host Frame:./main [0x2ebb]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
Here is my functions, a brief introduction on the code, I try to generate a random numbers and save them to a device variable weights, then use this vector to sample from discrete numbers.
#include<iostream>
#include<curand.h>
#include<curand_kernel.h>
#include<time.h>
using namespace std;
#define num 100
__device__ float weights[num];
// function to define seed
__global__ void initCurand(curandState *state, unsigned long seed){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
curand_init(seed, idx, 0, &state[idx]);
}
__device__ void sampling(float *weight, float max_weight, int *index, curandState *state){
int j;
float u;
do{
j = (int)(curand_uniform(state) * (num + 0.999999));
u = curand_uniform(state); //sample from uniform distribution;
}while( u > weight[j]/max_weight);
*index = j;
}
__global__ void test(int *dev_sample, curandState *state){
int idx = threadIdx.x + blockIdx.x * blockDim.x;\
// generate random numbers from uniform distribution and save them to weights
weights[idx] = curand_uniform(&state[idx]);
// run sampling function, in which, weights is an input for the function on each thread
sampling(weights, 1, dev_sample+idx, &state[idx]);
}
int main(){
// define the seed of random generator
curandState *devState;
cudaMalloc((void**)&devState, num*sizeof(curandState));
int *h_sample;
h_sample = (int*) malloc(num*sizeof(int));
int *d_sample;
cudaMalloc((void**)&d_sample, num*sizeof(float));
initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);
test<<<(int)num/32 + 1, 32>>>(d_sample, devState);
cudaMemcpy(h_sample, d_sample, num*sizeof(float), cudaMemcpyDeviceToHost);
for (int i = 0; i < num; ++i)
{
cout << *(h_sample + i) << endl;
}
//free memory
cudaFree(devState);
free(h_sample);
cudaFree(d_sample);
return 0;
}
Just start to learn cuda, if the methods to access the global memory is incorrect, please help me with that. Thanks
This is launching "extra" threads:
initCurand<<<(int)num/32 + 1, 32>>>(devState, 1);
num is 100, so the above config will launch 4 blocks of 32 threads each, i.e. 128 threads. But you are only allocating space for 100 curandState here:
cudaMalloc((void**)&devState, num*sizeof(curandState));
So your initCurand kernel will have some threads (idx = 100-127) that are attempting to initialize some curandState that you haven't allocated. As a result when you run cuda-memcheck which does fairly rigorous out-of-bounds checking, an error is reported.
One possible solution would be to modify your initCurand kernel as follows:
__global__ void initCurand(curandState *state, unsigned long seed, int num){
int idx = threadIdx.x + blockIdx.x * blockDim.x;
if (idx < num)
curand_init(seed, idx, 0, &state[idx]);
}
This will prevent any out-of-bounds threads from doing anything. Note that you will need to modify the kernel call to pass num to it. Also, it appears to me you have a similar problem in your test kernel. You may want to do something similar to fix it there. This is a common construct in CUDA kernels, I call it a "thread check". You can find other questions here on the SO tag discussing this same concept.

Initialize constant global array CUDA C

I have a problem! I need to initialize a constant global array in cuda c. To initialize the array i need to use a for! I need to do this because I have to use this array in some kernels and my professor told me to define as a constant visible only in the device.
How can I do this??
I want to do something like this:
#include <stdio.h>
#include <math.h>
#define N 8
__constant__ double H[N*N];
__global__ void prodotto(double *v, double *w){
int k=threadIdx.x+blockDim.x*blockIdx.x;
w[k]=0;
for(int i=0;i<N;i++) w[k]=w[k]+H[k*N+i]*v[i];
}
int main(){
double v[8]={1, 1, 1, 1, 1, 1, 1, 1};
double *dev_v, *dev_w, *w;
double *host_H;
host_H=(double*)malloc((N*N)*sizeof(double));
cudaMalloc((void**)&dev_v,sizeof(double));
cudaMalloc((void**)&dev_w,sizeof(double));
for(int k=0;k<N;k++){
host_H[2*N*k+2*k]=1/1.414;
host_H[2*N*k+2*k+1]=1/1.414;
host_H[(2*k+1)*N+2*k]=1/1.414;
host_H[(2*k+1)+2*k+1]=-1/1.414;
}
cudaMemcpyToSymbol(H, host_H, (N*N)*sizeof(double));
cudaMemcpy(dev_v, v, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(dev_w, w, N*sizeof(double), cudaMemcpyHostToDevice);
prodotto<<<1,N>>>(dev_v, dev_w);
cudaMemcpy(v, dev_v, N*sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(w, dev_w, N*sizeof(double), cudaMemcpyDeviceToHost);
for(int i=0;i<N;i++) printf("\n%f %f", v[i], w[i]);
return 0;
}
But the output is an array of zeros...I want the output array to be filled with the product of the matrix H(here seen as an array) and the array v.
Thanks !!!!!
Something like this should work:
#define DSIZE 32
__constant__ int mydata[DSIZE];
int main(){
...
int *h_mydata;
h_mydata = new int[DSIZE];
for (int i = 0; i < DSIZE; i++)
h_mydata[i] = ....; // initialize however you wish
cudaMemcpyToSymbol(mydata, h_mydata, DSIZE*sizeof(int));
...
}
Not difficult. You can then use the __constant__ data directly in a kernel:
__global__ void mykernel(...){
...
int myval = mydata[threadIdx.x];
...
}
You can read about __constant__ variables in the programming guide. __constant__ variables are read-only from the perspective of device code (kernel code). But from the host, they can be read from or written to using the cudaMemcpyToSymbol/cudaMemcpyFromSymbol API.
EDIT: Based on the code you've now posted, there were at least 2 errors:
Your allocation sizes for dev_v and dev_w were not correct.
You had no host allocation for w.
The following code seems to work correctly for me with those 2 fixes:
$ cat t579.cu
#include <stdio.h>
#include <math.h>
#define N 8
__constant__ double H[N*N];
__global__ void prodotto(double *v, double *w){
int k=threadIdx.x+blockDim.x*blockIdx.x;
w[k]=0;
for(int i=0;i<N;i++) w[k]=w[k]+H[k*N+i]*v[i];
}
int main(){
double v[N]={1, 1, 1, 1, 1, 1, 1, 1};
double *dev_v, *dev_w, *w;
double *host_H;
host_H=(double*)malloc((N*N)*sizeof(double));
w =(double*)malloc( (N)*sizeof(double));
cudaMalloc((void**)&dev_v,N*sizeof(double));
cudaMalloc((void**)&dev_w,N*sizeof(double));
for(int k=0;k<N;k++){
host_H[2*N*k+2*k]=1/1.414;
host_H[2*N*k+2*k+1]=1/1.414;
host_H[(2*k+1)*N+2*k]=1/1.414;
host_H[(2*k+1)+2*k+1]=-1/1.414;
}
cudaMemcpyToSymbol(H, host_H, (N*N)*sizeof(double));
cudaMemcpy(dev_v, v, N*sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(dev_w, w, N*sizeof(double), cudaMemcpyHostToDevice);
prodotto<<<1,N>>>(dev_v, dev_w);
cudaMemcpy(v, dev_v, N*sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(w, dev_w, N*sizeof(double), cudaMemcpyDeviceToHost);
for(int i=0;i<N;i++) printf("\n%f %f", v[i], w[i]);
printf("\n");
return 0;
}
$ nvcc -arch=sm_20 -o t579 t579.cu
$ cuda-memcheck ./t579
========= CUDA-MEMCHECK
1.000000 0.000000
1.000000 -0.707214
1.000000 -0.707214
1.000000 -1.414427
1.000000 1.414427
1.000000 0.707214
1.000000 1.414427
1.000000 0.707214
========= ERROR SUMMARY: 0 errors
$
A few notes:
Any time you're having trouble with a CUDA code, it's good practice to use proper cuda error checking.
You can run your code with cuda-memcheck (just as I have above) to get a quick read of whether any CUDA errors are encountered.
I've not verified the numerical results or worked through the math. If it's not what you wanted, I assume you can sort it out.
I've not made any changes to your code other than what seemed sensible to me to fix the obvious errors and make the results presentable for educational purposes. Certainly there can be discussions about preferred allocation methods, printf vs. cout, and what have you. I'm focused primarily on CUDA topics in this answer.

CUDA racecheck, shared memory array and cudaDeviceSynchronize()

I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.
In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3), I get errors depending on the parameters of the problem.
Here is an error example (initialization of shared memory on line 22, assignment on line 119):
========= ERROR: Potential WAW hazard detected at shared 0x0 in block (35, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3(Data*)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3(Data*)
========= Current Value : 13, Incoming Value : 0
The first thing that surprised me is the thread ids. When I first encountered the error, each block contained 32 threads (ids 0 to 31). So why is there a problem with the thread id 32? I even added an extra check on threadIdx.x, but this changed nothing.
I use shared memory as a temporary buffer, and each thread deals with its own parameters of a multidimensional array, e.g. __shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory.
Reducing the grid size from 64 blocks to 32 blocks seemed to solve the issue (with 32 threads per block). I do not understand why.
In order to understand what was happening, I tested with some simpler kernels.
Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float) B of shared memory, and I can use 48KB of shared memory per SM.
test.cu
template <unsigned int NTHREADS>
__global__ void kernel_test()
{
const int SIZE_X = 4;
const int SIZE_Y = 4;
__shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];
for (unsigned int i = 0; i < SIZE_X; i++)
for (unsigned int j = 0; j < SIZE_Y; j++)
tmp[i][j][threadIdx.x] = threadIdx.x;
}
int main()
{
const unsigned int NTHREADS = 32;
//kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
kernel_test<NTHREADS><<<64, NTHREADS>>>();
cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}
Compilation:
nvcc test.cu --ptxas-options=-v -o test
If we run the kernel:
cuda-memcheck --tool racecheck test
kernel_test<32><<<32, 32>>>(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
kernel_test<32><<<64, 32>>>(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.
========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128
========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0
So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)
** UPDATE **
The problem seems to be coming from cudaDeviceSynchronize() when NBLOCKS > 32. Why is this happening?
For starters, the cudaDeviceSynchronize() isn't the cause; your kernel is the cause, but it's an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize().
As for kernel, your shared memory is of size SIZE_X*SIZE_Y*NTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] -- this is where your problem is.
To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4.
When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3*64*4 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511.
Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS
NOTE: I originally posted this to https://devtalk.nvidia.com/default/topic/527292/cuda-programming-and-performance/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize-/
This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.