CUDA unknown error - cuda

I'm trying to run mainSift.cpp from CudaSift on a Nvidia Tesla M2090. First of all, as explained in this question, I had to change from sm_35 to sm_20 the CMakeLists.txt.
Unfortunatley now this error is returned:
checkMsg() CUDA error: LaplaceMulti() execution failed
in file </ghome/rzhengac/Downloads/CudaSift/cudaSiftH.cu>, line 318 : unknown error.
And this is the LaplaceMulti code:
double LaplaceMulti(cudaTextureObject_t texObj, CudaImage *results, float baseBlur, float diffScale, float initBlur)
{
float kernel[12*16];
float scale = baseBlur;
for (int i=0;i<NUM_SCALES+3;i++) {
float kernelSum = 0.0f;
float var = scale*scale - initBlur*initBlur;
for (int j=-LAPLACE_R;j<=LAPLACE_R;j++) {
kernel[16*i+j+LAPLACE_R] = (float)expf(-(double)j*j/2.0/var);
kernelSum += kernel[16*i+j+LAPLACE_R];
}
for (int j=-LAPLACE_R;j<=LAPLACE_R;j++)
kernel[16*i+j+LAPLACE_R] /= kernelSum;
scale *= diffScale;
}
safeCall(cudaMemcpyToSymbol(d_Kernel2, kernel, 12*16*sizeof(float)));
int width = results[0].width;
int pitch = results[0].pitch;
int height = results[0].height;
dim3 blocks(iDivUp(width+2*LAPLACE_R, LAPLACE_W), height);
dim3 threads(LAPLACE_W+2*LAPLACE_R, LAPLACE_S);
LaplaceMulti<<<blocks, threads>>>(texObj, results[0].d_data, width, pitch, height);
checkMsg("LaplaceMulti() execution failed\n");
return 0.0;
}
I've read already this question that seems somewhat similar, but I don't understand what the solution means or how to use it for my problem.
Why does the error occur?

The error occurs because you are running code which has features which are not supported on your GPU (texture objects). I am a little surprised that the compiler doesn't generate an error during compilation, but that is another question.
There is no solution except to use supported hardware, or to rewrite the code.
[This answer assembled from comments and added as a community wiki entry to get this answer off the unanswered list for the CUDA tag]

Related

CUDA/C - Using malloc in kernel functions gives strange results

I'm new to CUDA/C and new to stack overflow. This is my first question.
I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected.
I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.
In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.
Here's main code:
#define N_CELLE 1024*2
#define L_CELLE 512
extern "C" {
int main(int argc, char **argv) {
int *result = (int *)malloc(sizeof(int));
int *d_result;
int size_numbers = N_CELLE * sizeof(int *);
int **d_numbers;
cudaMalloc((void **)&d_numbers, size_numbers);
cudaMalloc((void **)&d_result, sizeof(int *));
kernel_one<<<2, 1024>>>(d_numbers);
cudaDeviceSynchronize();
kernel_two<<<1, 1>>>(d_numbers, d_result);
cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", *result);
cudaFree(d_numbers);
cudaFree(d_result);
free(result);
}
}
I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.
This is kernel_one code:
__global__ void kernel_one(int **d_numbers) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
for(int j=0; j<L_CELLE;j++)
d_numbers[i][j] = 1;
}
And this is kernel_two code:
__global__ void kernel_two(int **d_numbers, int *d_result) {
int temp = 0;
for(int i=0; i<N_CELLE; i++) {
for(int j=0; j<L_CELLE;j++)
temp += d_numbers[i][j];
}
*d_result = temp;
}
Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers.
Any idea of what the problem could be?
Thanks anyone!
In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.
I don't know anything about CUDA but these are severe bugs:
You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.
Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.

(How To ?) CUDA Linear Interpolation of Short (16-bit) Texture Object [duplicate]

This question already has an answer here:
CUDA Texture Linear Filtering
(1 answer)
Closed 6 years ago.
I am not a CUDA beginner; I've written a handful of CUDA methods for processing radar data (2D). I've read through most of the CUDA Programming Guide, CUDA By Example, and lots of posts here (Thank you stackoverflow contributors).
I mostly use pitched-linear memory. I've recently gotten into textures, and enjoyed the speed-up.
My question is: how do I get cudaFilterModeLinear to work with a texture object based on a signed 16-bit short?
Minimal reproducible code:
#include <helper_cuda.h> // checkCudaErrors
int main(int argc, char * argv[]){
const unsigned int Nr = 4096;
const unsigned int Na = 1024;
void * ptr;
size_t pitch;
const size_t width = Nr*sizeof( short );
const size_t height = Na;
checkCudaErrors( cudaMallocPitch( &ptr, &pitch, width, height) );
struct cudaResourceDesc resDesc;
struct cudaTextureDesc texDesc;
cudaTextureObject_t FrameTex;
memset( &resDesc, 0, sizeof(resDesc) );
resDesc.resType = cudaResourceTypePitch2D;
//resDesc.res.pitch2D.desc = cudaCreateChannelDesc<float>();
resDesc.res.pitch2D.desc = cudaCreateChannelDesc<short>();
resDesc.res.pitch2D.devPtr = ptr;
resDesc.res.pitch2D.pitchInBytes = pitch;
resDesc.res.pitch2D.width = width;
resDesc.res.pitch2D.height = height;
// Specify texture object parameters
memset( &texDesc, 0, sizeof(texDesc) );
texDesc.addressMode[0] = cudaAddressModeClamp;
texDesc.addressMode[1] = cudaAddressModeClamp;
// filter modes: Point, Linear
texDesc.filterMode = cudaFilterModeLinear;
// read modes: NormalizedFloat, ElementType
texDesc.readMode = cudaReadModeElementType;
texDesc.normalizedCoords = 0;
// Create texture object
checkCudaErrors( cudaCreateTextureObject( &FrameTex, &resDesc, &texDesc, NULL ));
cudaDeviceReset();
return 0;
}
This will throw
CUDA error at ... code=26(cudaErrorInvalidFilterSetting) "cudaCreateTextureObject( &FrameTex, &resDesc, &texDesc, NULL )"
The bottom of page 42 of CUDA Programming Guide v8.0 says "Linear texture filtering may be done only for textures that are configured to return floating-point data."
I have no problem if the return value is floating point. But how to base the texture on a 16-bit short?
This post demonstrates cudaFilterModeLinear with uchar, so surely it must be possible. The difference is the code is for Texture References, whereas I desire Texture Objects.
This post has the answer.
Although it was not immediately clear on searching through the posts, since the title does not indicate the issue is related to integral texel types.
Specifically, if the texel is integral type, then cudaReadModeNormalizedFloat must be used in conjunction with cudaFilterModeLinear. That is the implied meaning of "... configured to return floating-point data" from the Programming Guide. Geez, why can't Nvidia explicitely state that?
AFAIK there is no dependency on the addressMode or normalizedCoords enumerations.
Edit: cudaReadModeNormalizedFloat does not work with int32 texels. More generally, there is no hardware interpolation for int32.

Particular Allocating device memory for _global_ function in cuda

want to do this programm on cuda.
1.in "main.cpp"
struct Center{
double * Data;
int dimension;
};
typedef struct Center Center;
//I allow a pointer on N Center elements by the CUDAMALLOC like follow
....
#include "kernel.cu"
....
center *V_dev;
int M =100, n=4;
cudaStatus = cudaMalloc((void**)&V_dev,M*sizeof(Center));
Init<<<1,M>>>(V_dev, M, N); //I always know the dimension of N before calling
My "kernel.cu" file is something like this
#include "cuda_runtime.h"
#include"device_launch_parameters.h"
... //other include headers to allow my .cu file to know the Center type definition
__global__ void Init(Center *V, int N, int dimension){
V[threadIdx.x].dimension = dimension;
V[threadIdx.x].Data = (double*)malloc(dimension*sizeof(double));
for(int i=0; i<dimension; i++)
V[threadIdx.x].Data[i] = 0; //For the value, it can be any kind of operation returning a float that i want to be able put here
}
I'm on visual studio 2008 and CUDA 5.0. When I Build my project, I've got these errors:
error: calling a _host_ function("malloc") from a _global_ function("Init") is not allowed.
I want to know please how can I perform this? (I know that 'malloc' and other cpu memory allocation are not allowed for device memory.
malloc is allowed in device code but you have to be compiling for a cc2.0 or greater target GPU.
Adjust your VS project settings to remove any GPU device settings like compute_10,sm_10 and replace it with compute_20,sm_20 or higher to match your GPU. (And, to run that code, your GPU needs to be cc2.0 or higher.)
You need the compiler parameter -arch=sm_20 and a GPU which supports it.

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault

cudaMemcpyDeviceToHost error in basic example

I have recently started learning CUDA and I've integrated my CUDA into MS Visual Studio 2010 with Nsight. I have also acquired the book "CUDA by Example" and I'm going through all the examples and compiling them. I have come across an error however, which I do not understand.
The program comes from chapter 4 and it's the julia_gpu example. Original code:
#include "../common/book.h"
#include "../common/cpu_bitmap.h"
#define DIM 1000
struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
__device__ float magnitude2( void ) {
return r * r + i * i;
}
__device__ cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
__device__ cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
__device__ int julia( int x, int y ) {
const float scale = 1.5;
float jx = scale * (float)(DIM/2 - x)/(DIM/2);
float jy = scale * (float)(DIM/2 - y)/(DIM/2);
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}
return 1;
}
__global__ void kernel( unsigned char *ptr ) {
// map from blockIdx to pixel position
int x = blockIdx.x;
int y = blockIdx.y;
int offset = x + y * gridDim.x;
// now calculate the value at that position
int juliaValue = julia( x, y );
ptr[offset*4 + 0] = 255 * juliaValue;
ptr[offset*4 + 1] = 0;
ptr[offset*4 + 2] = 0;
ptr[offset*4 + 3] = 255;
}
// globals needed by the update routine
struct DataBlock {
unsigned char *dev_bitmap;
};
int main( void ) {
DataBlock data;
CPUBitmap bitmap( DIM, DIM, &data );
unsigned char *dev_bitmap;
HANDLE_ERROR( cudaMalloc( (void**)&dev_bitmap, bitmap.image_size() ) );
data.dev_bitmap = dev_bitmap;
dim3 grid(DIM,DIM);
kernel<<<grid,1>>>( dev_bitmap );
HANDLE_ERROR( cudaMemcpy( bitmap.get_ptr(), dev_bitmap,
bitmap.image_size(),
cudaMemcpyDeviceToHost ) );
HANDLE_ERROR( cudaFree( dev_bitmap ) );
bitmap.display_and_exit();
}
My Visual Studio however forces me to embelish the cuComplex constructor to device, otherwise it won't compile (it tells me I cannot use it later in the julia function), which I guess is fair enough. So I have:
__device__ cuComplex( float a, float b ) : r(a), i(b) {}
But when I do run the example (having added the necessary includes for it to run through VS, which is cuda_runtime.h and device_launch_parameters.h, as well as copying the glut32.dll into the same folder as the exe) it quickly fails, killing my device driver and saying it's due to an unknown error in line 94, which is the cudaMemcpy call in main. To be exact, it's the actual line containing the call "cudaDeviceToHost". To be frank however, I have tried creating some breakpoints line after line and the driver dies at the kernel call.
Could someone please tell me what might be wrong? I am a noob with CUDA and have no real idea why a trivial example would kill itself like that. What could I be doing wrong? Because frankly, I don't really even know what to investigate.
I have the CUDA 4.1 toolkit, NSight 2.1 and a GeForce GT445M with computational ability rated at 2.1 and the 295 version of the drivers.
I haven't had time to test this yet, but I think it may be your GFX "timing out" as far as windows is concerned.
Windows has a default behaviour from Vista to tell the gfx driver to recover after 2 seconds. If your job takes longer then you get booted. You can increase or remove this feature through the registry. I assume you need a reboot for this because I just made the changes and it's not working yet.
See this link for detail:
http://msdn.microsoft.com/en-us/windows/hardware/gg487368.aspx
...
Timeout Detection and Recovery : Windows Vista attempts to detect these
problematic hang situations and recover a responsive desktop
dynamically. In this process, the Windows Display Driver Model (WDDM)
driver is reinitialized and the GPU is reset. No reboot is necessary,
which greatly enhances the user experience. The only visible artifact
from the hang detection to the recovery is a screen flicker, which
results from resetting some portions of the graphics stack, causing a
screen redraw. Some older Microsoft DirectX applications may render to
a black screen at the end of this recovery. The end user would have to
restart these applications. The following is a brief overview of the
TDR process: ....
Clearly this is why its a weird bug because it will give you that mem copy error at different scales for different people depending on how fast their gfx is.
This is a known issue in CUDA.
You can try changing this:
const float scale = 1.5;
to something larger like 3.5, 4.5, 5.5.
example:
const float scale = 5.5;