CUDA: sum-reduction --- data lost in call to device function [duplicate] - cuda

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.
I'm trying to preform a sequential addressing reduction on input vector A into output vector B.
The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:
__global__ void vectorSum(int *A, int *B, int numElements) {
extern __shared__ int S[];
// Each thread loads one element from global to shared memory
int tid = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
S[tid] = A[i];
__syncthreads();
// Reduce in shared memory
for (int t = blockDim.x/2; t > 0; t>>=1) {
if (tid < t) {
S[tid] += S[tid + t];
}
__syncthreads();
}
if (tid == 0) B[blockIdx.x] = S[0];
}
}
and these are the kernel launch statements:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);
I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.

Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.
In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.
To fix this, change the launch code as follows:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);
[disclaimer: code written in browser, not compiled or tested, use at own risk]
This should at least fix the most obvious problem with your code.

Related

cuda kernel failed when block number smaller than max number

I have write a CUDA function same as cublasSdgmm in CUBLAS, and I find when I increase block number, function performance may be poorer, or even failed.
Here is the code, M = 9.6e6, S = 3, the best performance block number is 320, my GPU is GTX960, and the max block size is 2147483647 in X dimension.
__global__ void DgmmKernel(float *d_y, float *d_r, int M, int S){
int row = blockIdx.x*blockDim.x + threadIdx.x;
int col = blockIdx.y*blockDim.y + threadIdx.y;
while(row < M){
d_y[row + col * M] *= d_r[row];
row += blockDim.x * gridDim.x;
}
}
void Dgmm(float *d_y, float *d_r, int M, int S){
int xthreads_per_block = 1024;
dim3 dimBlock(xthreads_per_block, 1);
dim3 dimGrid(320, S);
DgmmKernel<<<dimBlock, dimGrid>>>(d_y, d_r, M, S);
}
I guess the reason is that there may be a resource limit in GPU, is it right?
If it is right, what specific resource limits the performance, the kernel function just reads two vectors and do a multiplication operation. And is there any method to improve performance on my GPU.
You have the block and grid dimension arguments reversed in your kernel launch, and your kernel should never be running. You should do something like this:
dim3 dimBlock(xthreads_per_block, 1);
dim3 dimGrid(320, S);
DgmmKernel<<<dimGrid, dimBlock>>>(d_y, d_r, M, S);
If your code contained appropriate runtime error checking, you would already be aware that the kernel launch is failing with an invalid configuration error for any value of S>3.

Implementing Neural Network using CUDA

I am trying to create a Neural Network using CUDA:
My kernel looks like :
__global__ void feedForward(float *input, float *output, float **weight) {
//Here the threadId uniquely identifies weight in a neuron
int weightIndex = threadIdx.x;
//Here the blockId uniquely identifies a neuron
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex]
* input[weightIndex];
}
While copying the output back to host, I'm getting an error
Error unspecified launch failure at line xx
At line xx :
CUDA_CHECK_RETURN(cudaMemcpy(h_output, d_Output, output_size, cudaMemcpyDeviceToHost));
Am I doing something wrong here?
Is it because of how I'm using both the block index as well as thread index to reference the weight matrix.
Or does the problem lie elsewhere ?
I'm allcoating the weight matrix as follows:
cudaMallocPitch((void**)&d_Weight, &pitch_W,input_size,NO_OF_NEURONS);
My kernel call is:
feedForward<<<NO_OF_NEURONS,NO_OF_WEIGHTS>>>(d_Input,d_Output,d_Weight);
After that i call:
cudaThreadSynchronize();
I am new to programming with CUDA.
Any help would be appreciated.
Thanks
There is a problem in output code. Though it won't produce the error described, it will produce incorrect results.
int neuronIndex = blockIdx.x;
if(neuronIndex<NO_OF_NEURONS && weightIndex<NO_OF_WEIGHTS)
output[neuronIndex] += weight[neuronIndex][weightIndex] * input[weightIndex];
We can see that all threads in single block are writing concurrently into one memory cell. So udefined results are expected. To avoid this I suggest reduce all values within a block in shared memory and perform a single write to global memory. Something like this:
__global__ void feedForward(float *input, float *output, float **weight) {
int weightIndex = threadIdx.x;
int neuronIndex = blockIdx.x;
__shared__ float out_reduce[NO_OF_WEIGHTS];
out_reduce[weightIndex] =
(weightIndex<NO_OF_WEIGHTS && neuronIndex<NO_OF_NEURONS) ?
weight[neuronIndex][weightIndex] * input[weightIndex]
: 0.0;
__syncthreads();
for (int s = NO_OF_WEIGHTS; s > 0 ; s >>= 1)
{
if (weightIndex < s) out_reduce[weightIndex] += out_reduce[weightIndex + s];
__syncthreads();
}
if (weightIndex == 0) output[neuronIndex] += out_reduce[weightIndex];
}
It turned out that I had to rewrite half of you small kernel to help with reduction code...
I build a very simple MLP network using CUDA. You can find my code over here if it may interest you: https://github.com/PirosB3/CudaNeuralNetworks/
For any questions, just shoot!
Daniel
You're using cudaMallocPitch, but don't show how the variables are initialized; I'd be willing to bet this is where your error stems from. cudaMallocPitch is rather tricky; the 3rd parameter should be in bytes, while the 4th parameter is not. i.e.
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&device_Ptr, &pitch, width * sizeof(float), height);
Is your variable input_size in bytes? If not, then you might be allocating too little memory (i.e. you'll think you're requesting 64 elements, but instead you'll be getting 64 bytes), and as such you'll be accessing memory out of range in your kernel. In my experience, an "unspecified launch failure" error usually means I have a segfault

Can multiple blocks and threads write to the same output?

I have the following CUDA kernel code which computes the sum squared error of two arrays.
__global__ void kSquaredError(double* data, double* recon, double* error,
unsigned int num_elements)
{
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
*error += pow(data[i] - recon[i], 2);
}
}
I need a single scalar output (error). In this case, it seems like all threads are writing to error simultaneously. Is there some way I need to synchronize it?
Currently I'm getting a bad result so I'm guessing there is some issue.
The implementation you are doing now is subject to race conditions due to the fact that all threads try to update the same global memory address at the same time. You could easily put a atomicAdd function instead of *error += pow... but that suffers from performance issues due to it being serialized on each update.
Instead you should try and and do a reduction using the shared memory, as following:
_global__ void kSquaredError(double* data, double* recon, double* error, unsigned int num_elements) {
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tid = threadIdx.x;
__shared__ double serror[blockDim.x];//temporary storage of each threads error
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
serror[tid] += pow(data[i] - recon[i], 2);//put each threads value in shared memory
}
__syncthreads();
int i = blockDim.x >> 1; //halve the threads
for(;i>0;i>>=1) {//reduction in shared memory
if(tid<i) {
serror[tid] += serror[tid+i];
__syncthreads();//make shure all threads are at the same state and have written to shared memory
}
}
if(tid == 0) {//thread 0 updates the value in global memory
atomicAdd(error,serror[tid]);// same as *error += serror[tid]; but atomic
}
}
It works by the following principle, each thread have its own temporary variable where it calculates the sum of the error for all its input, when it have finished all threads converge at the __syncthreads instruction to ensure that all data is complete.
Now half of all the threads in the block will take one value from the corresponding other half add add it to its own, half the threads again and do it again until you are left with one thread(thread 0) which will have the total sum.
Now thread 0 will uppdate the global memory with an atomicAdd function to avoid race condition with other blocks if there is any.
If we would just use the first example and use atomicAdd on every assignment. You would have gridDim.x*blockDim.x*num_elements atomic functions that would be serialized, now we have only gridDim.x atomic functions which is a lot less.
See Optimizing Parallel Reduction in CUDA for further reading on how reduction using cuda works.
Edit
Added if statement in the reduction for loop, forgot that.

CUDA: bad performance with shared memory and no parallelism

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:
__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;
// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];
// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();
// AND each int bitwise
for(j = 0; j < b1_size; j++)
shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);
// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}
The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:
void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{
int* _memory_device;
int* b1_memory;
int* b2_memory;
int b1_size = b1.getIntSize();
// allocate memory on GPU
(cudaMalloc((void **)&b1_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device, _memSizeInt * SIZE_UINT));
// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);
// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));
// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}
b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you
I did not really understood what your kernel wants to do.
You should read more about CUDA and GPU programming.
I tried to point out some of the mistakes.
Shared memory (sm) should reduce global memory reads.
Analyze your global memory (gm) read and write operations per thread.
a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm
So in total you have nothing gained. You could directly use the global memory.
You use all threads to write out one value at the block index "i".
You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.
You use a loop and don't use the loop counter at all.
You write at "tid" and read at "i" randomly.
This assignement is overhead.
unsigned int tid = threadIdx.x;
The results cannot be correct with more then one block since with one block tid = i!
All the wrong indexing results in wrong calculation using more then one block
The shared memory at "i" was never written!
_memory_device[i] = shared_memory_data[i];
My assumption what your kernel should do
/*
* Call kernel with x-block usage and up to 3D Grid
*/
__global__ void bitwiseAnd(int* outData_g,
const long long int inSize_s,
const int* inData1_g,
const int* inData2_g)
{
//get unique block index
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//get unique thread index
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
//check global unique thread range
if(threadId >= inSize_s)
return;
//output bitwise and
outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}
When you declare an extern __shared__ array, you must also specify its size in the kernel call.
The kernel configuration is:
<<< Dg, Db, Ns, S >>>
Ns is the size of the extern __shared__ arrays, and defaults to 0.
I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}

CUDA: Shared memory over a large-ish 2D array

I had a simple CUDA problem for a class assignment, but the professor added an optional task to implement the same algorithm using shared memory instead. I was unable to finish it before the deadline (as in, the turn-in date was a week ago) but I'm still curious so now I'm going to ask the internet ;).
The basic assignment was to implement a bastardized version of a red-black successive over-relaxation both sequentially and in CUDA, make sure you got the same result in both and then compare the speedup. Like I said, doing it with shared memory was an optional +10% add-on.
I'm going to post my working version and pseudocode what I've attempted to do since I don't have the code in my hands at the moment, but I can update this later with the actual code if someone needs it.
Before anyone says it: Yes, I know using CUtil is lame, but it made the comparison and timers easier.
Working global memory version:
#include <stdlib.h>
#include <stdio.h>
#include <cutil_inline.h>
#define N 1024
__global__ void kernel(int *d_A, int *d_B) {
unsigned int index_x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int index_y = blockIdx.y * blockDim.y + threadIdx.y;
// map the two 2D indices to a single linear, 1D index
unsigned int grid_width = gridDim.x * blockDim.x;
unsigned int index = index_y * grid_width + index_x;
// check for boundaries and write out the result
if((index_x > 0) && (index_y > 0) && (index_x < N-1) && (index_y < N-1))
d_B[index] = (d_A[index-1]+d_A[index+1]+d_A[index+N]+d_A[index-N])/4;
}
main (int argc, char **argv) {
int A[N][N], B[N][N];
int *d_A, *d_B; // These are the copies of A and B on the GPU
int *h_B; // This is a host copy of the output of B from the GPU
int i, j;
int num_bytes = N * N * sizeof(int);
// Input is randomly generated
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
A[i][j] = rand()/1795831;
//printf("%d\n",A[i][j]);
}
}
cudaEvent_t start_event0, stop_event0;
float elapsed_time0;
CUDA_SAFE_CALL( cudaEventCreate(&start_event0) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event0) );
cudaEventRecord(start_event0, 0);
// sequential implementation of main computation
for(i=1;i<N-1;i++) {
for(j=1;j<N-1;j++) {
B[i][j] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4;
}
}
cudaEventRecord(stop_event0, 0);
cudaEventSynchronize(stop_event0);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time0,start_event0, stop_event0) );
h_B = (int *)malloc(num_bytes);
memset(h_B, 0, num_bytes);
//ALLOCATE MEMORY FOR GPU COPIES OF A AND B
cudaMalloc((void**)&d_A, num_bytes);
cudaMalloc((void**)&d_B, num_bytes);
cudaMemset(d_A, 0, num_bytes);
cudaMemset(d_B, 0, num_bytes);
//COPY A TO GPU
cudaMemcpy(d_A, A, num_bytes, cudaMemcpyHostToDevice);
// create CUDA event handles for timing purposes
cudaEvent_t start_event, stop_event;
float elapsed_time;
CUDA_SAFE_CALL( cudaEventCreate(&start_event) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event) );
cudaEventRecord(start_event, 0);
// TODO: CREATE BLOCKS AND THREADS AND INVOKE GPU KERNEL
dim3 block_size(256,1,1); //values experimentally determined to be fastest
dim3 grid_size;
grid_size.x = N / block_size.x;
grid_size.y = N / block_size.y;
kernel<<<grid_size,block_size>>>(d_A,d_B);
cudaEventRecord(stop_event, 0);
cudaEventSynchronize(stop_event);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time,start_event, stop_event) );
//COPY B BACK FROM GPU
cudaMemcpy(h_B, d_B, num_bytes, cudaMemcpyDeviceToHost);
// Verify result is correct
CUTBoolean res = cutComparei( (int *)B, (int *)h_B, N*N);
printf("Test %s\n",(1 == res)?"Passed":"Failed");
printf("Elapsed Time for Sequential: \t%.2f ms\n", elapsed_time0);
printf("Elapsed Time for CUDA:\t%.2f ms\n", elapsed_time);
printf("CUDA Speedup:\t%.2fx\n",(elapsed_time0/elapsed_time));
cudaFree(d_A);
cudaFree(d_B);
free(h_B);
cutilDeviceReset();
}
For the shared memory version, this is what I've tried so far:
#define N 1024
__global__ void kernel(int *d_A, int *d_B, int width) {
//assuming width is 64 because that's the biggest number I can make it
//each MP has 48KB of shared mem, which is 12K ints, 32 threads/warp, so max 375 ints/thread?
__shared__ int A_sh[3][66];
//get x and y index and turn it into linear index
for(i=0; i < width+2; i++) //have to load 2 extra values due to the -1 and +1 in algo
A_sh[index_y%3][i] = d_A[index+i-1]; //so A_sh[index_y%3][0] is actually d_A[index-1]
__syncthreads(); //and hope that previous and next row have been loaded by other threads in the block?
//ignore boundary conditions because it's pseudocode
for(i=0; i < width; i++)
d_B[index+i] = A_sh[index_y%3][i] + A_sh[index_y%3][i+2] + A_sh[index_y%3-1][i+1] + A_sh[index_y%3+1][i+1];
}
main(){
//same init as above until threads/grid init
dim3 threadsperblk(32,16);
dim3 numblks(32,64);
kernel<<<numblks,threadsperblk>>>(d_A,d_B,64);
//rest is the same
}
This shared mem code crashes ("launch failed due to unspecified error") since I haven't caught all the boundary conditions yet, but I'm not worried about that as much as finding the correct way to get things going. I feel that my code is way too complicated to be the correct path (especially compared to the SDK examples), but I also can't see another way to do it since my array doesn't fit into shared mem like all the examples I can find.
And frankly, I'm not sure it would be that much faster on my hardware (GTX 560 Ti - runs the global memory version in 0.121ms), but I need to prove it to myself first :P
Edit 2: For anyone who runs across this in the future, the code in the answer is a good starting point if you want to do some shared memory.
The key to getting the maximum out of these sort of stencil operators in CUDA is data re-usage. I have found that the best approach is usually to have each block "walk" through a dimension of the grid. After the block has loaded an initial tile of data into shared memory, only a single dimension (so row in a row-major order 2D problem ) needs to be read from global memory to have the necessary data in shared memory for the second and subsequent row calculations. The rest of the data can just be reused. To visualise how the shared memory buffer looks through the first four steps of this sort of algorithm:
Three "rows" (a,b,c) of the input grid are loaded to shared memory, and the stencil computed for row (b) and written to global memory
aaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (d) is loaded into the shared memory buffer, replacing row (a), and the calculations made for row (c) using a different stencil, reflecting where the row data is in shared memory
dddddddddddddddd
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (e) is loaded into the shared memory buffer, replacing row (b), and the calculations made for row (d), using a different stencil from either step 1 or 2.
dddddddddddddddd
eeeeeeeeeeeeeeee
cccccccccccccccc
Another row (f) is loaded into the shared memory buffer, replacing row (c), and the calculations made for row (e). Now the data is back to the same layout as used in step 1, and the same stencil used in step 1 can be used.
dddddddddddddddd
eeeeeeeeeeeeeeee
ffffffffffffffff
The whole cycle repeats until the block has traverse full column length of the input grid. The reason for using different stencils rather than shifting the data in the shared memory buffer is down to performance - shared memory only has about 1000 Gb/s bandwidth on Fermi, and the shifting of data will become a bottleneck in fully optimal code. You should try different buffer sizes, because you might find smaller buffers allows for higher occupancy and improved kernel throughput.
EDIT: To give a concrete example of how that might be implemented:
template<int width>
__device__ void rowfetch(int *in, int *out, int col)
{
*out = *in;
if (col == 1) *(out-1) = *(in-1);
if (col == width) *(out+1) = *(in+1);
}
template<int width>
__global__ operator(int *in, int *out, int nrows, unsigned int lda)
{
// shared buffer holds three rows x (width+2) cols(threads)
__shared__ volatile int buffer [3][2+width];
int colid = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x + 1;
int * rowpos = &in[colid], * outpos = &out[colid];
// load the first three rows (compiler will unroll loop)
for(int i=0; i<3; i++, rowpos+=lda) {
rowfetch<width>(rowpos, &buffer[i][tid], tid);
}
__syncthreads(); // shared memory loaded and all threads ready
int brow = 0; // brow is the next buffer row to load data onto
for(int i=0; i<nrows; i++, rowpos+=lda, outpos+=lda) {
// Do stencil calculations - use the value of brow to determine which
// stencil to use
result = ();
// write result to outpos
*outpos = result;
// Fetch another row
__syncthreads(); // Wait until all threads are done calculating
rowfetch<width>(rowpos, &buffer[brow][tid], tid);
brow = (brow < 2) ? (brow+1) : 0; // Increment or roll brow over
__syncthreads(); // Wait until all threads have updated the buffer
}
}