Why is my rather trivial CUDA program erring with certain arguments? - cuda

I made a simple CUDA program for practice. It simply copies over data from one array to another:
import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule
# Global constants
N = 2**20 # size of array a
a = np.linspace(0, 1, N)
e = np.empty_like(a)
block_size_x = 512
# Instantiate block and grid sizes.
block_size = (block_size_x, 1, 1)
grid_size = (N / block_size_x, 1)
# Create the CUDA kernel, and run it.
mod = SourceModule("""
__global__ void D2x_kernel(double* a, double* e, int N) {
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if (tid > 0 && tid < N - 1) {
e[tid] = a[tid];
}
}
""")
func = mod.get_function('D2x_kernel')
func(a, cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)
print str(e)
However, I get this error: pycuda._driver.LogicError: cuLaunchKernel failed: invalid value
When I get rid of the second argument double* e in my kernel function and invoke the kernel without the argument e, the error goes away. Why is that? What does this error mean?

Your a array does not exist in device memory, so I suspect that PyCUDA is ignoring (or otherwise handling) the first argument to your kernel invocation and only passing in e and N...so you get an error because the kernel was expecting three arguments and it has only received two. Removing double* e from your kernel definition might eliminate the error message you're getting, but your kernel still won't work properly.
A quick fix to this should be to wrap a in a cuda.In() call, which instructs PyCUDA to copy a to the device before launching the kernel. That is, your kernel launch line should be:
func(cuda.In(a), cuda.InOut(e), np.int32(N), block=block_size, grid=grid_size)
Edit: Also, do you realize that your kernel is not copying the first and last elements of a to e? Your if (tid > 0 && tid < N - 1) statement is preventing that. For the entire array, it should be if (tid < N).

Related

CUDA cudaMemcpyFromSymbol "invalid device symbol" error?

So I see a parent question about how to copy from host to the constant memory on GPU using cudaMemcpyToSymbol.
My question is how to do the reverse, copying from device constant memory to the host using cudaMemcpyFromSymbol.
In the following minimal reproducible example, I either got
1) invalid device symbol error using cudaMemcpyFromSymbol(const_d_a, b, size);, or
2) got segmentation fault if I use cudaMemcpyFromSymbol(&b, const_d_a, size, cudaMemcpyDeviceToHost).
I have consulted with the manual which suggests I code as in 1), and this SO question that suggests I code as in 2). Neither of them work here.
Could anyone kindly help suggesting a workaround with this? I must be understanding something improperly... Thanks!
Here is the code:
// a basic CUDA function to test working with device constant memory
#include <stdio.h>
#include <cuda.h>
const unsigned int N = 10; // size of vectors
__constant__ float const_d_a[N * sizeof(float)];
int main()
{
float * a, * b; // a and b are vectors. c is the result
a = (float *)calloc(N, sizeof(float));
b = (float *)calloc(N, sizeof(float));
/**************************** Exp 1: sequential ***************************/
int i;
int size = N * sizeof(float);
for (i = 0; i < N; i++){
a[i] = (float)i / 0.23 + 1;
}
// 1. copy a to constant memory
cudaError_t err = cudaMemcpyToSymbol(const_d_a, a, size);
if (err != cudaSuccess){
printf("%s in %s at line %d\n", cudaGetErrorString(err), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
cudaError_t err2 = cudaMemcpyFromSymbol(const_d_a, b, size);
if (err2 != cudaSuccess){
printf("%s in %s at line %d\n", cudaGetErrorString(err2), __FILE__, __LINE__);
exit(EXIT_FAILURE);
}
double checksum0, checksum1;
for (i = 0; i < N; i++){
checksum0 += a[i];
checksum1 += b[i];
}
printf("Checksum for elements in host memory is %f\n.", checksum0);
printf("Checksum for elements in constant memory is %f\n.", checksum1);
return 0;
}
In CUDA, the various cudaMemcpy* operations are modeled after the C standard library memcpy routine. In that function, the first pointer is always the destination pointer and the second pointer is always the source pointer. That is true for all cudaMemcpy* functions as well.
Therefore, if you want to do cudaMemcpyToSymbol, the symbol had better be the first (destination) argument passed to the function (the second argument would be a host pointer). If you want to do cudaMemcpyFromSymbol, the symbol needs to be the second argument (the source position), and the host pointer is the first argument. That's not what you have here:
cudaError_t err2 = cudaMemcpyFromSymbol(const_d_a, b, size);
^ ^
| This should be the symbol.
|
This is supposed to be the host destination pointer.
You can discover this with a review of the API documentation.
If we reverse the order of those two arguments in that line of code:
cudaError_t err2 = cudaMemcpyFromSymbol(b, const_d_a, size);
Your code will run with no errors and the final results printed will match.
There is no need to use an ampersand with either of the a or b pointers in these functions. a and b are already pointers. In the example you linked, pi_gpu_h is not a pointer. It is an ordinary variable. To copy something to it using cudaMemcpyFromSymbol, it is necessary to take the address of that ordinary variable, because the function expects a (destination) pointer.
As an aside, this doesn't look right:
__constant__ float const_d_a[N * sizeof(float)];
This is effectively a static array declaration, and apart from the __constant__ decorator it should be done equivalently to how you would do it in C or C++. It's not necessary to multiply N by sizeof(float) here, if you want storage for N float quantities. Just N by itself will do that:
__constant__ float const_d_a[N];
however leaving that as-is does not create problems for the code you have posted.

Heisenbug in CUDA kernel, global memory access

About two years ago, I wrote a kernel for work on several numerical grids simultaneously. Some very strange behaviour emerged, which resulted in wrong results. When hunting down the bug utilizing printf()-statements inside the kernel, the bug vanished.
Due to deadline constraints, I kept it that way, though recently I figured that this was no appropriate coding style. So I revisited my kernel and boiled it down to what you see below.
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
This kernel computes the derivative of some value at all grid points for a number of numerical 1D-grids.
Things to consider: (see full code base at the bottom)
a grid consists of 1300 grid points
each grid has to be worked upon by two blocks (due to memory/register limitations)
each block successively works on 37 grids (or better: grid halves, the while-loop takes care of that)
each thread is responsible for the same grid point in each grid
for the derivative to be computed, the threads need access to data from the four next grid points
in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
due to the boiling down process, the two threads on each side of the blocks do no computations, in the original they are responsible for writing the corresponing grid values to shared memory
The values of the grids are stored in u_arr, du_arr and r_arr (and their corresponding device arrays d_u, d_du and d_r).
Each grid occupies 1300 consecutive values in each of these arrays.
The while-loop in the kernel iterates over 37 grids for each block.
To evaluate the workings of the kernel, each grid is initialized with the exact same values, so a deterministic program will produce the same result for each grid.
This does not happen with my code.
The weirdness of the Heisenbug:
I compared the computed values of grid 0 with each of the other grids, and there are differences at the overlap (grid points 666-669), though not consistently. Some grids have the right values, some do not. Two consecutive runs will mark different grids as erroneous.
The first thing that came to mind was that two threads at this overlap try to concurrently write to memory, though that does not seem to be the case (I checked.... and re-checked).
Commenting or un-commenting lines or using printf() for debugging purposes will alter
the outcome of the program as well: When "asking" the threads responsible for the grid points in question, they tell me that everything is allright, and they are actually correct. As soon as I force a thread to print out its variables, they will be computed (and more importantly: stored) correctly.
The same goes for debugging with Nsight Eclipse.
Memcheck / Racecheck:
cuda-memcheck (memcheck and racecheck) report no memory/racecondition problems, though even the usage of one of these tools have the ability to impact the correctness of the results.
Valgrind gives some warnings, though I think they have something to do with the CUDA API which I can not influence and which seem unrelated to my problem.
(Update)
As pointed out, cuda-memcheck --tool racecheck only works for shared memory race conditions, whereas the problem at hand has a race condition on d_u, i.e., global memory.
Testing environment:
The original kernel has been tested on different CUDA devices and with different compute capabilities (2.0, 3.0 and 3.5) with the bug showing up in every configuration (in some form or another).
My (main) testsystem is the following:
2 x GTX 460, tested on both the GPU that ran the X-server as well as
the other one
Driver Version: 340.46
Cuda Toolkit 6.5
Linux Kernel 3.11.0-12-generic (Linux Mint 16 - Xfce)
State of solution:
By now I am pretty sure that some memory access is the culprit, maybe some optimization from the compiler or use of uninitialized values, and that I obviously do not understand some fundamental CUDA paradigm.
The fact that printf() statements inside the kernel (which through some dark magic have to utilize device and host memory as well) and memcheck algorithms (cuda-memcheck and valgrind) influence
the bevavior point in the same direction.
I am sorry for this somewhat complicated kernel, but I boiled the original kernel and invocation down as much as I could, and this is as far as I got. By now I have learned to admire this problem, and I am looking forward to learning what is going on here.
Two "solutions", which force the kernel do work as intended, are marked in the code.
(Update) As mentioned in the correct answer below, the problem with my code is a race condition at the border of the thread-blocks. As there are two blocks working on each grid and there is no guarantee as to which block works first, resulting in the behavior outlined below. It also explains the correct results when employing "Solution 1" as mentioned in the code, because the input/output value d_u is not altered when uchange = 1.0.
The simple solution is to split this kernel into two kernels, one computing d_u, the other computing the derivative d_du. It would be more desirable to have just one kernel invocation instead of two, though I do not know how to accomplish this with -arch=sm_20. With -arch=sm_35 one could probably use dynamic parallelism to achieve that, though the overhead for the second kernel invocation is negligible.
heisenbug.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
const float r_sol = 6.955E8f;
__constant__ float d_fourpoint_constant = 0.2f;
__launch_bounds__(672, 2)
__global__ void heisenkernel(float *d_u, float *d_r, float *d_du, int radius,
int numNodesPerGrid, int numBlocksPerSM, int numGridsPerSM, int numGrids)
{
__syncthreads();
int id_sm = blockIdx.x / numBlocksPerSM; // (arbitrary) ID of Streaming Multiprocessor (SM) this thread works upon - (constant over lifetime of thread)
int id_blockOnSM = blockIdx.x % numBlocksPerSM; // Block number on this specific SM - (constant over lifetime of thread)
int id_r = id_blockOnSM * (blockDim.x - 2*radius) + threadIdx.x - radius; // Grid point number this thread is to work upon - (constant over lifetime of thread)
int id_grid = id_sm * numGridsPerSM; // Grid ID this thread is to work upon - (not constant over lifetime of thread)
while(id_grid < numGridsPerSM * (id_sm + 1)) // this loops over numGridsPerSM grids
{
__syncthreads();
int id_numInArray = id_grid * numNodesPerGrid + id_r; // Entry in array this thread is responsible for (read and possibly write) - (not constant over lifetime of thread)
float uchange = 0.0f;
//uchange = 1.0f; // if this line is uncommented, results will be computed correctly ("Solution 1")
float du = 0.0f;
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
if (id_r == 0) // FO-forward difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray])/(d_r[id_numInArray+1] - d_r[id_numInArray]);
else if (id_r == numNodesPerGrid - 1) // FO-rearward difference
du = (d_u[id_numInArray] - d_u[id_numInArray-1])/(d_r[id_numInArray] - d_r[id_numInArray-1]);
else if (id_r == 1 || id_r == numNodesPerGrid - 2) //SO-central difference
du = (d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1]);
else if(id_r > 1 && id_r < numNodesPerGrid - 2)
du = d_fourpoint_constant * ((d_u[id_numInArray+1] - d_u[id_numInArray-1])/(d_r[id_numInArray+1] - d_r[id_numInArray-1])) + (1-d_fourpoint_constant) * ((d_u[id_numInArray+2] - d_u[id_numInArray-2])/(d_r[id_numInArray+2] - d_r[id_numInArray-2]));
else
du = 0;
}
__syncthreads();
if((threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius) && (id_r < numNodesPerGrid) && (id_grid < numGrids))
{
d_u[ id_numInArray] = d_u[id_numInArray] * uchange; // if this line is commented out, results will be computed correctly ("Solution 2")
d_du[ id_numInArray] = du;
}
__syncthreads();
++id_grid;
}
}
bool gridValuesEqual(float *matarray, uint id0, uint id1, const char *label, int numNodesPerGrid){
bool retval = true;
for(uint i=0; i<numNodesPerGrid; ++i)
if(matarray[id0 * numNodesPerGrid + i] != matarray[id1 * numNodesPerGrid + i])
{
printf("value %s at position %u of grid %u not equal that of grid %u: %E != %E, diff: %E\n",
label, i, id0, id1, matarray[id0 * numNodesPerGrid + i], matarray[id1 * numNodesPerGrid + i],
matarray[id0 * numNodesPerGrid + i] - matarray[id1 * numNodesPerGrid + i]);
retval = false;
}
return retval;
}
int main(int argc, const char* argv[])
{
float *d_u;
float *d_du;
float *d_r;
float *u_arr;
float *du_arr;
float *r_arr;
int numNodesPerGrid = 1300;
int numBlocksPerSM = 2;
int numGridsPerSM = 37;
int numSM = 7;
int TPB = 672;
int radius = 2;
int numGrids = 259;
int memsize_grid = sizeof(float) * numNodesPerGrid;
int numBlocksPerGrid = numNodesPerGrid / (TPB - 2 * radius) + (numNodesPerGrid%(TPB - 2 * radius) == 0 ? 0 : 1);
printf("---------------------------------------------------------------------------\n");
printf("--- Heisenbug Extermination Tracker ---------------------------------------\n");
printf("---------------------------------------------------------------------------\n\n");
cudaSetDevice(0);
cudaDeviceReset();
cudaMalloc((void **) &d_u, memsize_grid * numGrids);
cudaMalloc((void **) &d_du, memsize_grid * numGrids);
cudaMalloc((void **) &d_r, memsize_grid * numGrids);
u_arr = new float[numGrids * numNodesPerGrid];
du_arr = new float[numGrids * numNodesPerGrid];
r_arr = new float[numGrids * numNodesPerGrid];
for(uint k=0; k<numGrids; ++k)
for(uint i=0; i<numNodesPerGrid; ++i)
{
uint index = k * numNodesPerGrid + i;
if (i < 585)
r_arr[index] = i * (6000.0f);
else
{
if (i == 585)
r_arr[index] = r_arr[index - 1] + 8.576E-6f * r_sol;
else
r_arr[index] = r_arr[index - 1] + 1.02102f * ( r_arr[index - 1] - r_arr[index - 2] );
}
u_arr[index] = 1E-10f * (i+1);
du_arr[index] = 0.0f;
}
/*
printf("\n\nbefore kernel start\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
bool equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Input values are not identical for different grids!\n\n");
else
printf("All grids contain the same values at same grid points.!\n\n");
cudaMemcpy(d_u, u_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_du, du_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
cudaMemcpy(d_r, r_arr, memsize_grid * numGrids, cudaMemcpyHostToDevice);
printf("Configuration:\n\n");
printf("numNodesPerGrid:\t%i\nnumBlocksPerSM:\t\t%i\nnumGridsPerSM:\t\t%i\n", numNodesPerGrid, numBlocksPerSM, numGridsPerSM);
printf("numSM:\t\t\t\t%i\nTPB:\t\t\t\t%i\nradius:\t\t\t\t%i\nnumGrids:\t\t\t%i\nmemsize_grid:\t\t%i\n", numSM, TPB, radius, numGrids, memsize_grid);
printf("numBlocksPerGrid:\t%i\n\n", numBlocksPerGrid);
printf("Kernel launch parameters:\n\n");
printf("moduleA2_3<<<%i, %i, %i>>>(...)\n\n", numBlocksPerSM * numSM, TPB, 0);
printf("Launching Kernel...\n\n");
heisenkernel<<<numBlocksPerSM * numSM, TPB, 0>>>(d_u, d_r, d_du, radius, numNodesPerGrid, numBlocksPerSM, numGridsPerSM, numGrids);
cudaDeviceSynchronize();
cudaMemcpy(u_arr, d_u, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(du_arr, d_du, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
cudaMemcpy(r_arr, d_r, memsize_grid * numGrids, cudaMemcpyDeviceToHost);
/*
printf("\n\nafter kernel finished\n\n");
for(uint k=0; k<numGrids; ++k)
printf("matrix->du_arr[k*paramH.numNodes + 668]:\t%E\n", du_arr[k*numNodesPerGrid + 668]);//*/
equal = true;
for(int k=1; k<numGrids; ++k)
{
equal &= gridValuesEqual(u_arr, 0, k, "u", numNodesPerGrid);
equal &= gridValuesEqual(du_arr, 0, k, "du", numNodesPerGrid);
equal &= gridValuesEqual(r_arr, 0, k, "r", numNodesPerGrid);
}
if(!equal)
printf("Results are wrong!!\n");
else
printf("All went well!\n");
cudaFree(d_u);
cudaFree(d_du);
cudaFree(d_r);
delete [] u_arr;
delete [] du_arr;
delete [] r_arr;
return 0;
}
Makefile:
CUDA = 1
DEFINES =
ifeq ($(CUDA), 1)
DEFINES += -DCUDA
CUDAPATH = /usr/local/cuda-6.5
CUDAINCPATH = -I$(CUDAPATH)/include
CUDAARCH = -arch=sm_20
endif
CXX = g++
CXXFLAGS = -pipe -g -std=c++0x -fPIE -O0 $(DEFINES)
VALGRIND = valgrind
VALGRIND_FLAGS = -v --leak-check=yes --log-file=out.memcheck
CUDAMEMCHECK = cuda-memcheck
CUDAMC_FLAGS = --tool memcheck
RACECHECK = $(CUDAMEMCHECK)
RACECHECK_FLAGS = --tool racecheck
INCPATH = -I. $(CUDAINCPATH)
LINK = g++
LFLAGS = -O0
LIBS =
ifeq ($(CUDA), 1)
NVCC = $(CUDAPATH)/bin/nvcc
LIBS += -L$(CUDAPATH)/lib64/
LIBS += -lcuda -lcudart -lcudadevrt
NVCCFLAGS = -g -G -O0 --ptxas-options=-v
NVCCFLAGS += -lcuda -lcudart -lcudadevrt -lineinfo --machine 64 -x cu $(CUDAARCH) $(DEFINES)
endif
all:
$(NVCC) $(NVCCFLAGS) $(INCPATH) -c -o $(DST_DIR)heisenbug.o $(SRC_DIR)heisenbug.cu
$(LINK) $(LFLAGS) -o heisenbug heisenbug.o $(LIBS)
clean:
rm heisenbug.o
rm heisenbug
memrace: all
./heisenbug > out
$(VALGRIND) $(VALGRIND_FLAGS) ./heisenbug > out.memcheck.log
$(CUDAMEMCHECK) $(CUDAMC_FLAGS) ./heisenbug > out.cudamemcheck
$(RACECHECK) $(RACECHECK_FLAGS) ./heisenbug > out.racecheck
Note that in the entirety of your writeup, I do not see a question being explicitly asked, therefore I am responding to:
I am looking forward to learning what is going on here.
You have a race condition on d_u.
by your own statement:
•in order to keep the blocks indepentend from each other, a small overlap on the grid is introduced (grid points 666, 667, 668, 669 of each grid are read from by two threads from different blocks, though only one thread is writing to them, it is this overlap where the problems occur)
Furthermore, if you comment out the write to d_u, according to your statement in the code, the problem disappears.
CUDA threadblocks can execute in any order. You have at least 2 different blocks that are reading from grid points 666, 667, 668, 669. The results will be different depending on which case actually occurs:
both blocks read the value before any writes occur.
one block reads the value, then a write occurs, then the other block reads the value.
The blocks are not independent of each other (contrary to your statement) if one block is reading a value that can be written to by another block. The order of block execution will determine the result in this case, and CUDA does not specify the order of block execution.
Note that cuda-memcheck with the -tool racecheck option only captures race conditions related to __shared__ memory usage. Your kernel as posted uses no __shared__ memory, therefore I would not expect cuda-memcheck to report anything.
cuda-memcheck, in order to gather its data, does influence the order of block execution, so it's not surprising that it affects the behavior.
in-kernel printf represents a costly function call, writing to a global memory buffer. So it also affects execution behavior/patterns. And if you are printing out a large amount of data, exceeding the buffer lines of output, the effect is extremely costly (in terms of execution time) in the event of buffer overflow.
As an aside, Linux Mint is not a supported distro for CUDA, as far as I can see. However I don't think this is relevant to your issue; I can reproduce the behavior on a supported config.

Dot Product in CUDA using atomic operations - getting wrong results

I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:
#include <stdio.h>
#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float
// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c)
{
__shared__ num_t temp[THREADS_PER_BLOCK];
int index = threadIdx.x + blockIdx.x * blockDim.x;
temp[threadIdx.x] = a[index] * b[index];
__syncthreads(); //Synchronize!
*c = 0.00;
// Does it need to be tid==0 that
// undertakes this task?
if (0 == threadIdx.x) {
num_t sum = 0.00;
int i;
for (i=0; i<THREADS_PER_BLOCK; i++)
sum += temp[i];
atomicAdd(c, sum);
//WRONG: *c += sum; This read-write operation must be atomic!
}
}
// Initialize the vectors:
void init_vector(num_t *x)
{
int i;
for (i=0 ; i<N ; i++){
x[i] = 0.001 * i;
}
}
// MAIN
int main(void)
{
num_t *a, *b, *c;
num_t *dev_a, *dev_b, *dev_c;
size_t size = N * sizeof(num_t);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);
a = (num_t*)malloc(size);
b = (num_t*)malloc(size);
c = (num_t*)malloc(size);
init_vector(a);
init_vector(b);
cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);
dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);
cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);
printf("a = [\n");
int i;
for (i=0;i<10;i++){
printf("%g\n",a[i]);
}
printf("...\n");
for (i=N-10;i<N;i++){
printf("%g\n",a[i]);
}
printf("]\n\n");
printf("a*b = %g.\n", *c);
free(a); free(b); free(c);
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
}
and I compile it with:
/usr/local/cuda-5.0/bin/nvcc -m64 -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart
Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:
N = 2048 * 8;
a = zeros(N,1);
for i=1:N
a(i) = 0.001*(i-1);
end
dot_product = a'*a;
But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?
Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).
At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example
Comment this line out of your kernel:
// *c = 0.00;
And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):
num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);
And I believe you'll get results that match matlab, more or less.
The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.
By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.
Edit: responding to a question below in the comments:
A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.

cuda gdb: the kernel indicated is not in the code

My original problem, is that I have functions with a long list of arguments, that exceeded the memory that is allowed to be passed as an argument to a cuda kernel (I don't remember how many bytes, because it's been a while since I dealt with that). So, the way I bypassed this problem, was to define a new structure that its members are pointers pointing to other structures that I can dereference from within the kernel later.
... this is where the current problem begins: at the point where I'm trying to dereference the pointers (members of the structure I created earlier) from within the kernel, I get CUDA_EXCEPTION_5, Warp Out-of-range Address
...from the cuda-gdb. And of top of that, the kernel name and arguments (which are reported 'not live at this point' which cuda-gdb gives as the one with the error, is not one that I created in my code.
Now, for the more specifics :
here are the structures involved:
typedef struct {
int strx;
int stry;
int strz;
float* el;
} manmat;
typedef struct {
manmat *x;
manmat *y;
manmat *z;
} manmatvec;
here's how i'm trying to group the kernel's arguments inside the main:
int main () {
...
...
manmat resu0;
resu0.strx = n+2; resu0.stry = m+2; resu0.strz = l+2;
if (cudaMalloc((void**)&resu0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resu0" << endl ;
manmat resv0;
resv0.strx = n+2; resv0.stry = m+2; resv0.strz = l+2;
if (cudaMalloc((void**)&resv0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resv0" << endl ;
manmat resw0;
resw0.strx = n+2; resw0.stry = m+2; resw0.strz = l+2;
if (cudaMalloc((void**)&resw0.el,sizeof(float) * (n+2)*(m+2)*(l+2)) != cudaSuccess) cout << endl << " ERROR allocating memory for manmat resw0" << endl ;
manmatvec residues0 ;
residues0.x = &resu0;
residues0.y = &resv0;
residues0.z = &resw0;
exec_res_std_2d <<<numBlocks2D, threadsPerBlock2D>>> (residues0, ......) ;
.....
}
... and this is what happens in the kernel :
__global__ void exec_res_std_2d (manmatvec residues, ......) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
int k = blockIdx.y * blockDim.y + threadIdx.y;
manmat *resup;
manmat *resvp;
manmat *reswp;
resup = residues.x;
resvp = residues.y;
reswp = residues.z;
manmat resu, resv, resw ;
resu.strx = (*resup).strx; //LINE 1626
resu.stry = (*resup).stry;
resu.strz = (*resup).strz;
resu.el = (*resup).el;
resv = *resvp;
resw = *reswp;
.....
}
and finally, this is what cuda-gdb gives as output :
..................
[Launch of CUDA Kernel 1065 (exec_res_std_2d<<<(1,2,1),(32,16,1)>>>) on Device 0]
[Launch of CUDA Kernel 1066 (exec_res_bot_2d<<<(1,2,1),(32,16,1)>>>) on Device 0]
Program received signal CUDA_EXCEPTION_5, Warp Out-of-range Address.
[Switching focus to CUDA kernel 1065, grid 1066, block (0,0,0), thread (0,2,0), device 0, sm 0, warp 2, lane 0]
0x0000000003179020 in fdivide<<<(1,2,1),(32,16,1)>>> (a=warning: Variable is not live at this point. Value is undetermined.
..., pt=warning: Variable is not live at this point. Value is undetermined.
..., cells=warning: Variable is not live at this point. Value is undetermined.
...) at ola.cu:1626
1626 ola.cu: No such file or directory.
in ola.cu
I have to note that I haven't defined ANY function , __device__ or __global__ in my code called fdivide.....
Also, it might be important to say that, in the beginning of the run of the program inside the debugger, despite the fact that I compile my cuda c files with -arch=sm_20 -g -G -gencode arch=compute_20,code=sm_20, I get,
[New Thread 0x7ffff3b69700 (LWP 12465)]
[Context Create of context 0x1292340 on Device 0]
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.1619c10.o.LkkWns
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.1940ad0.o.aHtC7W
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2745680.o.bVXEWl
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2c438b0.o.cgUqiP
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2c43980.o.4diaQ4
warning: no loadable sections found in added symbol-file /tmp/cuda-dbg/12456/session1/elf.1292340.2dc9380.o.YYJAr5
Any answers or hints, or suggestions that could help me with this issue are very welcome!
please note that i've only recently started programming with cuda-c, and I'm not very experienced with cuda-gdb. Most of the debugging I did in C code I did it 'manually' by checking the output at various points of the code....
Also, this code is running on tesla M2090, and is also compiled to run on 2.0 architecture.
This will be a problem:
manmatvec residues0 ;
residues0.x = &resu0;
residues0.y = &resv0;
residues0.z = &resw0;
The resu0, resv0, and resw0 variables are allocated in host memory - on the host stack. You're putting host addresses into the manmatvec structure, then passing the manmatvec into the kernel. On the receiving end, the CUDA code cannot access the host memory addresses provided in the structure.
If you're going to pass the addresses of the resu0, resv0, resw0 variables, you need to allocate them from device memory.
I don't know if this is the entire problem, but I'm pretty sure it's a top contributor.

CUDA-GDB crashes in Kernel

I've been trying to debug my code, as I know something is going wrong in the Kernel, and I've been trying to figure out what specifically. If I try to step into the kernel it seems to completely step over the kernel functions, and will eventually cause an error on quitting:
Single stepping until exit from function dyld_stub_cudaSetupArgument,
which has no line number information.
[Launch of CUDA Kernel 0 (incrementArrayOnDevice<<<(3,1,1),(4,1,1)>>>) on
Device 0]
[Termination of CUDA Kernel 0 (incrementArrayOnDevice<<<(3,1,1),
(4,1,1)>>>) on Device 0]
[Launch of CUDA Kernel 1 (fillinBoth<<<(40,1,1),(1,1,1)>>>) on Device 0]
[Termination of CUDA Kernel 1 (fillinBoth<<<(40,1,1),(1,1,1)>>>) on Device 0]
add (below=0x124400, newtip=0x124430, newfork=0x125ac0) at test.cu:1223
And if I try to break in the Kernel my entire computer crashes and I have to restart it.
I figure there must be something wrong with the way I'm calling the kernel, but I can't figure out what.
The code is rather long, so I'm only including an excerpt of it:
__global__ void fillinOne(seqptr qset, long max) {
int i, j;
aas aa;
int idx = blockIdx.x;
__shared__ long qs[3];
if(idx < max)
{
memcpy(qs, qset[idx], sizeof(long[3]));
for (i = 0; i <= 1; i++)
{
for (aa = ala; (long)aa <= (long)stop; aa = (aas)((long)aa + 1))
{
if (((1L << ((long)aa)) & qs[i]) != 0)
{
for (j = i + 1; j <= 2; j++)
qs[j] |= cudaTranslate[(long)aa - (long)ala][j - i];
}
}
}
}
}
//Kernel for left!= NULL and rt != NULL
void fillin(node *p, node *left, node *rt)
{
cudaError_t err = cudaGetLastError();
size_t stepsize = chars * sizeof(long);
size_t sitesize = chars * sizeof(sitearray);
//int i, j;
if (left == NULL)
{
//copy rt->numsteps into p->numsteps--doesn't actually require CUDA, because no computation to do
memcpy(p->numsteps, rt->numsteps, stepsize);
checkCUDAError("memcpy");
//allocate siteset (array of sitearrays) on device
seqptr qsites; //as in array of qs's
cudaMalloc((void **) &qsites, sitesize);
checkCUDAError("malloc");
//copy rt->siteset into device array (equivalent to memcpy(qs, rs) but for whole array)
cudaMemcpy(qsites, rt->siteset, sitesize, cudaMemcpyHostToDevice);
checkCUDAError("memcpy");
//do loop in device
int block_size = 1; //each site operated on independently
int n_blocks = chars;
fillinOne <<< n_blocks, block_size>>> (qsites, chars);
cudaThreadSynchronize();
//put qset in p->siteset--equivalent to memcpy(p->siteset[m], qs)
cudaMemcpy(p->siteset, qsites, sitesize, cudaMemcpyDeviceToHost);
checkCUDAError("memcpy");
//Cleanup
cudaFree(qsites);
}
If anyone has any ideas at all, please resond! Thanks in advance!
I suppose you have a single card configuration. When you are debugging a cuda kernel and you break inside it you effectively put the display driver in pause. That causes what you think is a crash. If you want to use the cuda-gdb with only one graphics card you must use it in command line mode (don't start X or press ctrl-alt-fn from X).
If you have two cards you must run the code in the card not running the display. Use cudaSelectDevice(n).