Cuda Memcpy Device to Host : unspecified error launch failure - cuda

This is a simple test program that I have been working on (to help aid with debugging my work on a running sum function) and I just cannot seem to find whats wrong. The program simply calls my running sum function on a small list and attempts to print out the data. The line thats creating all the trouble is the one thats commented out. Its the cudaMemcpy(DeviceToHost). When that line is part of the code, the error I get is :
CUDA error at: student_func.cu:136 unspecified launch failure
cudaGetLastError() terminate called after throwing an instance of
'thrust::system::system_error' what(): unload of CUDA runtime failed
I simply do not know whats wrong with this and its driving me insane. I tried using regular old malloc with the same result. I have confirmed that the input data gets copied over to the device array fine (by printing in the kernel) but simply am not able to copy back the results from Device to Host. I would really appreciate any help whatsoever! Thanks in advance :)
unsigned int numElems = 100;
unsigned int blockLength = min( (unsigned int) 1024, (unsigned int) numElems);
unsigned int gridLength = ceil ( (float) numElems / (float) blockLength );
unsigned int* d_in;
unsigned int* h_in;
checkCudaErrors(cudaMallocHost(&h_in, sizeof(unsigned int) * numElems));
for (int i = 0; i < numElems; i++)
{
h_in[i] = i;
}
checkCudaErrors(cudaMalloc(&d_in, sizeof(unsigned int) * numElems));
checkCudaErrors(cudaMemcpy(d_in, h_in, sizeof(unsigned int) * numElems, cudaMemcpyHostToDevice));
exclusive_running_sum<<< gridLength, blockLength >>>(d_in, d_in, numElems);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
//this line is a problem!!
//checkCudaErrors(cudaMemcpy(h_in, d_in, sizeof(unsigned int) * numElems, cudaMemcpyDeviceToHost));
for (int i = 0; i < numElems; i++)
{
printf("%i %i\n", i, h_in[i]);
}

Thanks to everyone for the help. I have found the bug. After much debugging, I have realized that I (very very foolishly) forgot about the fact that I had used an externally allocated shared data within the kernel.

Related

CUDA/C - Using malloc in kernel functions gives strange results

I'm new to CUDA/C and new to stack overflow. This is my first question.
I'm trying to allocate memory dynamically in a kernel function, but the results are unexpected.
I read using malloc() in a kernel can lower performance a lot, but I need it anyway so I first tried with a simple int ** array just to test the possibility, then I'll actually need to allocate more complex structs.
In my main I used cudaMalloc() to allocate the space for the array of int *, and then I used malloc() for every thread in the kernel function to allocate the array for every index of the outer array. I then used another thread to check the result, but it doesn't always work.
Here's main code:
#define N_CELLE 1024*2
#define L_CELLE 512
extern "C" {
int main(int argc, char **argv) {
int *result = (int *)malloc(sizeof(int));
int *d_result;
int size_numbers = N_CELLE * sizeof(int *);
int **d_numbers;
cudaMalloc((void **)&d_numbers, size_numbers);
cudaMalloc((void **)&d_result, sizeof(int *));
kernel_one<<<2, 1024>>>(d_numbers);
cudaDeviceSynchronize();
kernel_two<<<1, 1>>>(d_numbers, d_result);
cudaMemcpy(result, d_result, sizeof(int), cudaMemcpyDeviceToHost);
printf("%d\n", *result);
cudaFree(d_numbers);
cudaFree(d_result);
free(result);
}
}
I used extern "C"because I could't compile while importing my header, which is not used in this example code. I pasted it since I don't know if this may be relevant or not.
This is kernel_one code:
__global__ void kernel_one(int **d_numbers) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
d_numbers[i] = (int *)malloc(L_CELLE*sizeof(int));
for(int j=0; j<L_CELLE;j++)
d_numbers[i][j] = 1;
}
And this is kernel_two code:
__global__ void kernel_two(int **d_numbers, int *d_result) {
int temp = 0;
for(int i=0; i<N_CELLE; i++) {
for(int j=0; j<L_CELLE;j++)
temp += d_numbers[i][j];
}
*d_result = temp;
}
Everything works fine (aka the count is correct) until I use less than 1024*2*512 total blocks in device memory. For example, if I #define N_CELLE 1024*4 the program starts giving "random" results, such as negative numbers.
Any idea of what the problem could be?
Thanks anyone!
In-kernel memory allocation draws memory from a statically allocated runtime heap. At larger sizes, you are exceeding the size of that heap and then your two kernels are attempting to read and write from uninitialised memory. This produces a runtime error on the device and renders the results invalid. You would already know this if you either added correct API error checking on the host side, or ran your code with the cuda-memcheck utility.
The solution is to ensure that the heap size is set to something appropriate before trying to run a kernel. Adding something like this:
size_t heapsize = sizeof(int) * size_t(N_CELLE) * size_t(2*L_CELLE);
cudaDeviceSetLimit(cudaLimitMallocHeapSize, heapsize);
to your host code before any other API calls, should solve the problem.
I don't know anything about CUDA but these are severe bugs:
You cannot convert from int** to void**. They are not compatible types. Casting doesn't solve the problem, but hides it.
&d_numbers gives the address of a pointer to pointer which is wrong. It is of type int***.
Both of the above bugs result in undefined behavior. If your program somehow seems to works in some condition, that's just by pure (bad) luck only.

argument of type "int *" is incompatible with parameter of type "int" in cuda kernel call

I've been trying for a while and have come across seemingly similar issues already posted however for some reason I'm still failing to clear the error. I'm effectively want to pass a 2D matrix to the kernel as a 1D array as I have seen suggested. I'm not sure where I've gone wrong in my syntax but there is a clash in terms of the variable I supply to the kernel and the parameter that kernel expects.
__global__ void calculatePath(int source, int target, int *cost, int distance){
int t_id = blockIdx.x * blockDim.x + threadIdx.x;
int dist[50];
int prev[50];
int selected[50]={0};
int num_path[50];
int d, m, min, start, j;
if ((t_id > 0) && (t_id < N)){
dist[t_id] = IN;
prev[t_id] = -1;
}
This is my kernel function whose parameters are all integers except "cost" which is a pointer to an integer array.
int main(int argc, char **argv){
int h_num_path[N];
int h_distance = 0;
int h_cost[N][N],i,j,co;
int h_source;
int h_target;
printf("\tShortest Path Algorithm(DIJKSRTRA's ALGORITHM\n\n");
for(i=0;i< N;i++)
for(j=0;j< N;j++)
h_cost[i][j] = IN;
//*********************
srand ( time(NULL));
for(int x=1;x< N;x++) {
for (int y = x + 1; y < N; y++) {
h_cost[x][y] = h_cost[y][x] = (rand() % 100) + 1;
}
}
printf("\nEnter The Source: ");
scanf("%d", &h_source);
printf("\nEnter The target: ");
scanf("%d", &h_target);
int *d_num_path;
int *d_cost;
int *d_source;
int *d_target;
int *d_dist;
int *d_prev;
int *d_distance;
cudaMalloc(&d_num_path, sizeof(int)*N);
cudaMalloc(&d_cost, sizeof(int)*N*N);
cudaMalloc((void**) &d_source, sizeof(int));
cudaMalloc((void**) &d_target, sizeof(int));
cudaMalloc((void**) &d_dist, sizeof(int)*N);
cudaMalloc((void**) &d_distance, sizeof(int));
cudaMemcpy(d_source, &h_source, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_target, &h_target, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_cost, h_cost, sizeof(int)*N*N, cudaMemcpyHostToDevice);
cudaMemcpy(d_distance, &h_distance, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_num_path, &h_num_path, sizeof(int)*N, cudaMemcpyHostToDevice);
clock_t before;
before = clock();
calculatePath<<<N/512 + 1, 512>>>(d_source, d_target, d_cost, d_distance);
clock_t time_taken = clock() - before;
cudaMemcpy(&h_num_path, d_num_path, sizeof(int)*N, cudaMemcpyDeviceToHost);
cudaMemcpy(&h_distance, d_distance, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(d_num_path);
cudaFree(d_cost);
cudaFree(d_source);
cudaFree(d_target);
cudaFree(d_dist);
cudaFree(d_prev);
cudaFree(d_distance);
printf("\nShortest Path: %d \n",co);
printf("%s %.4f %s", "Time taken:", time_taken/1000.0, "seconds");
return 0;
}
On the kernel call, I however receive the error that "argument of type 'int *' is incompatible with parameter of type 'int'" yet I believe my d_cost already is a pointer. I'd appreciate being set straight as I'm sure I'm overlooking something small.
It is not d_target you are having trouble with. The other three arguments are int* but corresponding parameters are declared as int.
The C Programming Language by K&R at page 25 says:
We will generally use parameter for a variable named in the parenthesized list in a function definition, and argument for the value used in a call of the function.
Since your source and target are just a single integer values, you don't really need to define device side variables for them. Just pass the integer value itself as an argument. By doing so, you'll get performance improvements as talonmies commented:
(With pass by value) there is constant memory cache broadcast within the kernel if it is done that way. Passing pointers for simple constants just increases latency by forcing every thread to dereference the pointer to retrieve the value from global memory, plus all the additional host side memory APIs to allocate them in the first place.
Also, you seem to expect parameter distance to have output value of your kernel, then it must be declared as a pointer, so you can do cudaMemcpyDeviceToHost after kernel.
__global__ void calculatePath(int source, int target, int *cost, int *distance) // kernel definition
caculatePath<<< (N + 511) / 512, 512 >>>(h_source, h_target, d_cost, d_distance) // kernel launch
Three of your arguments need to be integers, but you are passing pointers to integers. You need to change your method signature:
__global__ void calculatePath(int *source, int *target, int *cost, int *distance)

How to write a pointer-chasing benchmark using 64-bit pointers in CUDA?

This research paper runs a series of several CUDA microbenchmarks on a GPU to obtain statistics like global memory latency, instruction throughput, etc. This link is the link to the set of microbenchmarks that the authors wrote and ran on their GPU.
One of the microbenchmarks called global.cu gives the code for a pointer-chasing benchmark to measure global memory latency.
This is the code of the kernel that is run.
__global__ void global_latency (unsigned int ** my_array, int array_length, int iterations, int ignore_iterations, unsigned long long * duration) {
unsigned int start_time, end_time;
unsigned int *j = (unsigned int*)my_array;
volatile unsigned long long sum_time;
sum_time = 0;
duration[0] = 0;
for (int k = -ignore_iterations; k < iterations; k++) {
if (k==0) {
sum_time = 0; // ignore some iterations: cold icache misses
}
start_time = clock();
repeat256(j=*(unsigned int **)j;) // unroll macro, simply creates an unrolled loop of 256 instructions, nothing more
end_time = clock();
sum_time += (end_time - start_time);
}
((unsigned int*)my_array)[array_length] = (unsigned int)j;
((unsigned int*)my_array)[array_length+1] = (unsigned int) sum_time;
duration[0] = sum_time;
}
The line of code performing the pointer chasing in the case of 32-bit pointers is:
j = *(unsigned int**)j;
This is the key line, because the remaining lines of code are only used for time measurement.
I tried to run this on my GPU, but I faced an issue. Running the same microbenchmark with no changes gives me a runtime error of An illegal memory access was encountered.
In the same link they explain that:
The global memory tests use pointer chasing code where the pointer values are stored in an array. Pointers on GT200 are 32 bits. The global memory test will need to be changed if the pointer size changes, e.g., 64-bit pointers on Fermi.
It turns out that my GPU is of Kepler architecture, which has 64-bit pointers.
How do I modify that bit of pointer-chasing code which originally deals with 32-bit pointers, in order to measure global memory latency using 64-bit pointers?
Edit:
From havogt's answer: An important piece of information that I should have included in the question is this portion of the code, where an array of memory locations is built where each entry points to the entry for the next pointer.
for (i = 0; i < N; i += step) {
// Device pointers are 32-bit on GT200.
h_a[i] = ((unsigned int)(uintptr_t)d_a) + ((i + stride) % N)*sizeof(unsigned int);
}
Introduction
Before I explain what you have to do to make the code working let me emphasize the following: You should have a very good understanding of the hardware you are testing and the design of your microbenchmark. Why is it important? The original code was designed for the GT200 which did not have a cache for ordinary global memory loads. If you now just fix the pointer problem you will measure basically the L2 latency (on Kepler, where by default L1 is not used) because the original code uses a very small memory which fits nicely into the cache.
Disclaimer: For me it is also the first time to study such a benchmarking code. Therefore, check carefully before you use the code below. I do not guarantee that I did not make mistakes, when transforming the original code.
The simple solution (measures basically the cache latency)
First, you did not include all relevant parts of the code in your question. The most important part is
for (i = 0; i < N; i += step) {
// Device pointers are 32-bit on GT200.
h_a[i] = ((unsigned int)(uintptr_t)d_a) + ((i + stride) % N)*sizeof(unsigned int);
}
where an array of memory locations is built where each entry points to the entry for the next pointer.
Now all you need to do is replace all unsigned int (which is used for storing the 32-bit pointers) by unsigned long long int, both in the setup code and in the kernel.
I won't post the code since I cannot recommend running such code if you don't understand it, see Introduction. If you understand it, then it is simple.
My solution
Basically what I did is to use as much memory as needed to evaluate all pointers or a maximal amount of memory of 1GB. In both cases I wrapped the last entry to the first entry. Note that depending on the stride, a lot of array entries may be uninitialized (because they are never used).
The following code is basically the original code after a bit of clean-up (but it's still not very clean, sorry...) and the change in the memory. I introduced a typedef
typedef unsigned long long int ptrsize_type;
to highlight at which locations the unsigned int from the original code has to be replaced with unsigned long long int. I used the repeat1024 macro (from the original code) which just copies the line j=*(ptrsize_type **)j; 1024 times.
The strides can be adjusted in measure_global_latency(). In the output the stride is given in bytes.
I leave the interpretation of the latency for the different strides to you. The strides need to be adjusted such that you do not reuse the cache!
#include <stdio.h>
#include <stdint.h>
#include "repeat.h"
typedef unsigned long long int ptrsize_type;
__global__ void global_latency (ptrsize_type** my_array, int array_length, int iterations, unsigned long long * duration) {
unsigned long long int start_time, end_time;
ptrsize_type *j = (ptrsize_type*)my_array;
volatile unsigned long long int sum_time;
sum_time = 0;
for (int k = 0; k < iterations; k++)
{
start_time = clock64();
repeat1024(j=*(ptrsize_type **)j;)
end_time = clock64();
sum_time += (end_time - start_time);
}
((ptrsize_type*)my_array)[array_length] = (ptrsize_type)j;
((ptrsize_type*)my_array)[array_length+1] = (ptrsize_type) sum_time;
duration[0] = sum_time;
}
void parametric_measure_global(int N, int iterations, unsigned long long int maxMem, int stride)
{
unsigned long long int maxMemToArraySize = maxMem / sizeof( ptrsize_type );
unsigned long long int maxArraySizeNeeded = 1024*iterations*stride;
unsigned long long int maxArraySize = (maxMemToArraySize<maxArraySizeNeeded)?(maxMemToArraySize):(maxArraySizeNeeded);
ptrsize_type* h_a = new ptrsize_type[maxArraySize+2];
ptrsize_type** d_a;
cudaMalloc ((void **) &d_a, (maxArraySize+2)*sizeof(ptrsize_type));
unsigned long long int* duration;
cudaMalloc ((void **) &duration, sizeof(unsigned long long int));
for ( int i = 0; true; i += stride)
{
ptrsize_type nextAddr = ((ptrsize_type)d_a)+(i+stride)*sizeof(ptrsize_type);
if( i+stride < maxArraySize )
{
h_a[i] = nextAddr;
}
else
{
h_a[i] = (ptrsize_type)d_a; // point back to the first entry
break;
}
}
cudaMemcpy((void *)d_a, h_a, (maxArraySize+2)*sizeof(ptrsize_type), cudaMemcpyHostToDevice);
unsigned long long int latency_sum = 0;
int repeat = 1;
for (int l=0; l <repeat; l++)
{
global_latency<<<1,1>>>(d_a, maxArraySize, iterations, duration);
cudaThreadSynchronize ();
cudaError_t error_id = cudaGetLastError();
if (error_id != cudaSuccess)
{
printf("Error is %s\n", cudaGetErrorString(error_id));
}
unsigned long long int latency;
cudaMemcpy( &latency, duration, sizeof(unsigned long long int), cudaMemcpyDeviceToHost);
latency_sum += latency;
}
cudaFree(d_a);
cudaFree(duration);
delete[] h_a;
printf("%f\n", (double)(latency_sum/(repeat*1024.0*iterations)) );
}
void measure_global_latency()
{
int maxMem = 1024*1024*1024; // 1GB
int N = 1024;
int iterations = 1;
for (int stride = 1; stride <= 1024; stride+=1)
{
printf (" %5d, ", stride*sizeof( ptrsize_type ));
parametric_measure_global( N, iterations, maxMem, stride );
}
for (int stride = 1024; stride <= 1024*1024; stride+=1024)
{
printf (" %5d, ", stride*sizeof( ptrsize_type ));
parametric_measure_global( N, iterations, maxMem, stride );
}
}
int main()
{
measure_global_latency();
return 0;
}
Edit:
Some more details to the comments: I did not include the interpretation of the result because I do not consider myself an expert on such benchmarks. It was not my intend to make the interpretation an exercise to the reader.
Now here is my interpretation: I get the same results for Kepler GPUs (with L1 not available/disabled). Something below 200 cycles for a L2 read is what you get with a small stride. The accuracy can be improved by increasing the iterations variable to definitely reuse L2.
The tricky task is now to find a stride that does not reuse the L2 cache. In my approach I just blindly try many different (large) strides and hope that L2 is not reused. There, I also get something around ~500 cycles. Of course, the better approach would be to think more about the structure of the cache and deduce the correct stride by reasoning and not by trial and error. That's the main reason why I don't wanted to interpret the result myself.
Why is the latency decreasing again for strides > 1MB? The reason for this behaviour is that I used a fixed size of 1GB for the maximal memory usage. With the 1024 pointer lookups (repeat1024), a stride of 1MB just fits in the memory. Larger strides will wrap-around and use again data from the L2 cache. The main problem with the current code is that the 1024 pointer (1024*64 bit) still fit perfectly in the L2 cache.
This introduces another trap: If you set the number of iterations to something > 1 and exceed the memory limit with 1024*iterations*stride*sizeof(ptrsize_type) you will again use the L2 cache.
Possible solution:
Instead of wrapping the last entry to the first element, one should implement a smarter wrapping to an (unused!) location which is between the size of the cache-line and the stride. But you need to be very careful that you do not overwrite memory locations, especially if you are wrapping around multiple times.

CUDA: sum-reduction --- data lost in call to device function [duplicate]

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.
I'm trying to preform a sequential addressing reduction on input vector A into output vector B.
The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:
__global__ void vectorSum(int *A, int *B, int numElements) {
extern __shared__ int S[];
// Each thread loads one element from global to shared memory
int tid = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
S[tid] = A[i];
__syncthreads();
// Reduce in shared memory
for (int t = blockDim.x/2; t > 0; t>>=1) {
if (tid < t) {
S[tid] += S[tid + t];
}
__syncthreads();
}
if (tid == 0) B[blockIdx.x] = S[0];
}
}
and these are the kernel launch statements:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);
I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.
Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.
In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.
To fix this, change the launch code as follows:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);
[disclaimer: code written in browser, not compiled or tested, use at own risk]
This should at least fix the most obvious problem with your code.

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault