CUDA: Shared memory over a large-ish 2D array - cuda

I had a simple CUDA problem for a class assignment, but the professor added an optional task to implement the same algorithm using shared memory instead. I was unable to finish it before the deadline (as in, the turn-in date was a week ago) but I'm still curious so now I'm going to ask the internet ;).
The basic assignment was to implement a bastardized version of a red-black successive over-relaxation both sequentially and in CUDA, make sure you got the same result in both and then compare the speedup. Like I said, doing it with shared memory was an optional +10% add-on.
I'm going to post my working version and pseudocode what I've attempted to do since I don't have the code in my hands at the moment, but I can update this later with the actual code if someone needs it.
Before anyone says it: Yes, I know using CUtil is lame, but it made the comparison and timers easier.
Working global memory version:
#include <stdlib.h>
#include <stdio.h>
#include <cutil_inline.h>
#define N 1024
__global__ void kernel(int *d_A, int *d_B) {
unsigned int index_x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int index_y = blockIdx.y * blockDim.y + threadIdx.y;
// map the two 2D indices to a single linear, 1D index
unsigned int grid_width = gridDim.x * blockDim.x;
unsigned int index = index_y * grid_width + index_x;
// check for boundaries and write out the result
if((index_x > 0) && (index_y > 0) && (index_x < N-1) && (index_y < N-1))
d_B[index] = (d_A[index-1]+d_A[index+1]+d_A[index+N]+d_A[index-N])/4;
}
main (int argc, char **argv) {
int A[N][N], B[N][N];
int *d_A, *d_B; // These are the copies of A and B on the GPU
int *h_B; // This is a host copy of the output of B from the GPU
int i, j;
int num_bytes = N * N * sizeof(int);
// Input is randomly generated
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
A[i][j] = rand()/1795831;
//printf("%d\n",A[i][j]);
}
}
cudaEvent_t start_event0, stop_event0;
float elapsed_time0;
CUDA_SAFE_CALL( cudaEventCreate(&start_event0) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event0) );
cudaEventRecord(start_event0, 0);
// sequential implementation of main computation
for(i=1;i<N-1;i++) {
for(j=1;j<N-1;j++) {
B[i][j] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4;
}
}
cudaEventRecord(stop_event0, 0);
cudaEventSynchronize(stop_event0);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time0,start_event0, stop_event0) );
h_B = (int *)malloc(num_bytes);
memset(h_B, 0, num_bytes);
//ALLOCATE MEMORY FOR GPU COPIES OF A AND B
cudaMalloc((void**)&d_A, num_bytes);
cudaMalloc((void**)&d_B, num_bytes);
cudaMemset(d_A, 0, num_bytes);
cudaMemset(d_B, 0, num_bytes);
//COPY A TO GPU
cudaMemcpy(d_A, A, num_bytes, cudaMemcpyHostToDevice);
// create CUDA event handles for timing purposes
cudaEvent_t start_event, stop_event;
float elapsed_time;
CUDA_SAFE_CALL( cudaEventCreate(&start_event) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event) );
cudaEventRecord(start_event, 0);
// TODO: CREATE BLOCKS AND THREADS AND INVOKE GPU KERNEL
dim3 block_size(256,1,1); //values experimentally determined to be fastest
dim3 grid_size;
grid_size.x = N / block_size.x;
grid_size.y = N / block_size.y;
kernel<<<grid_size,block_size>>>(d_A,d_B);
cudaEventRecord(stop_event, 0);
cudaEventSynchronize(stop_event);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time,start_event, stop_event) );
//COPY B BACK FROM GPU
cudaMemcpy(h_B, d_B, num_bytes, cudaMemcpyDeviceToHost);
// Verify result is correct
CUTBoolean res = cutComparei( (int *)B, (int *)h_B, N*N);
printf("Test %s\n",(1 == res)?"Passed":"Failed");
printf("Elapsed Time for Sequential: \t%.2f ms\n", elapsed_time0);
printf("Elapsed Time for CUDA:\t%.2f ms\n", elapsed_time);
printf("CUDA Speedup:\t%.2fx\n",(elapsed_time0/elapsed_time));
cudaFree(d_A);
cudaFree(d_B);
free(h_B);
cutilDeviceReset();
}
For the shared memory version, this is what I've tried so far:
#define N 1024
__global__ void kernel(int *d_A, int *d_B, int width) {
//assuming width is 64 because that's the biggest number I can make it
//each MP has 48KB of shared mem, which is 12K ints, 32 threads/warp, so max 375 ints/thread?
__shared__ int A_sh[3][66];
//get x and y index and turn it into linear index
for(i=0; i < width+2; i++) //have to load 2 extra values due to the -1 and +1 in algo
A_sh[index_y%3][i] = d_A[index+i-1]; //so A_sh[index_y%3][0] is actually d_A[index-1]
__syncthreads(); //and hope that previous and next row have been loaded by other threads in the block?
//ignore boundary conditions because it's pseudocode
for(i=0; i < width; i++)
d_B[index+i] = A_sh[index_y%3][i] + A_sh[index_y%3][i+2] + A_sh[index_y%3-1][i+1] + A_sh[index_y%3+1][i+1];
}
main(){
//same init as above until threads/grid init
dim3 threadsperblk(32,16);
dim3 numblks(32,64);
kernel<<<numblks,threadsperblk>>>(d_A,d_B,64);
//rest is the same
}
This shared mem code crashes ("launch failed due to unspecified error") since I haven't caught all the boundary conditions yet, but I'm not worried about that as much as finding the correct way to get things going. I feel that my code is way too complicated to be the correct path (especially compared to the SDK examples), but I also can't see another way to do it since my array doesn't fit into shared mem like all the examples I can find.
And frankly, I'm not sure it would be that much faster on my hardware (GTX 560 Ti - runs the global memory version in 0.121ms), but I need to prove it to myself first :P
Edit 2: For anyone who runs across this in the future, the code in the answer is a good starting point if you want to do some shared memory.

The key to getting the maximum out of these sort of stencil operators in CUDA is data re-usage. I have found that the best approach is usually to have each block "walk" through a dimension of the grid. After the block has loaded an initial tile of data into shared memory, only a single dimension (so row in a row-major order 2D problem ) needs to be read from global memory to have the necessary data in shared memory for the second and subsequent row calculations. The rest of the data can just be reused. To visualise how the shared memory buffer looks through the first four steps of this sort of algorithm:
Three "rows" (a,b,c) of the input grid are loaded to shared memory, and the stencil computed for row (b) and written to global memory
aaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (d) is loaded into the shared memory buffer, replacing row (a), and the calculations made for row (c) using a different stencil, reflecting where the row data is in shared memory
dddddddddddddddd
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (e) is loaded into the shared memory buffer, replacing row (b), and the calculations made for row (d), using a different stencil from either step 1 or 2.
dddddddddddddddd
eeeeeeeeeeeeeeee
cccccccccccccccc
Another row (f) is loaded into the shared memory buffer, replacing row (c), and the calculations made for row (e). Now the data is back to the same layout as used in step 1, and the same stencil used in step 1 can be used.
dddddddddddddddd
eeeeeeeeeeeeeeee
ffffffffffffffff
The whole cycle repeats until the block has traverse full column length of the input grid. The reason for using different stencils rather than shifting the data in the shared memory buffer is down to performance - shared memory only has about 1000 Gb/s bandwidth on Fermi, and the shifting of data will become a bottleneck in fully optimal code. You should try different buffer sizes, because you might find smaller buffers allows for higher occupancy and improved kernel throughput.
EDIT: To give a concrete example of how that might be implemented:
template<int width>
__device__ void rowfetch(int *in, int *out, int col)
{
*out = *in;
if (col == 1) *(out-1) = *(in-1);
if (col == width) *(out+1) = *(in+1);
}
template<int width>
__global__ operator(int *in, int *out, int nrows, unsigned int lda)
{
// shared buffer holds three rows x (width+2) cols(threads)
__shared__ volatile int buffer [3][2+width];
int colid = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x + 1;
int * rowpos = &in[colid], * outpos = &out[colid];
// load the first three rows (compiler will unroll loop)
for(int i=0; i<3; i++, rowpos+=lda) {
rowfetch<width>(rowpos, &buffer[i][tid], tid);
}
__syncthreads(); // shared memory loaded and all threads ready
int brow = 0; // brow is the next buffer row to load data onto
for(int i=0; i<nrows; i++, rowpos+=lda, outpos+=lda) {
// Do stencil calculations - use the value of brow to determine which
// stencil to use
result = ();
// write result to outpos
*outpos = result;
// Fetch another row
__syncthreads(); // Wait until all threads are done calculating
rowfetch<width>(rowpos, &buffer[brow][tid], tid);
brow = (brow < 2) ? (brow+1) : 0; // Increment or roll brow over
__syncthreads(); // Wait until all threads have updated the buffer
}
}

Related

CUDA: sum-reduction --- data lost in call to device function [duplicate]

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.
I'm trying to preform a sequential addressing reduction on input vector A into output vector B.
The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:
__global__ void vectorSum(int *A, int *B, int numElements) {
extern __shared__ int S[];
// Each thread loads one element from global to shared memory
int tid = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
S[tid] = A[i];
__syncthreads();
// Reduce in shared memory
for (int t = blockDim.x/2; t > 0; t>>=1) {
if (tid < t) {
S[tid] += S[tid + t];
}
__syncthreads();
}
if (tid == 0) B[blockIdx.x] = S[0];
}
}
and these are the kernel launch statements:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);
I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.
Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.
In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.
To fix this, change the launch code as follows:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);
[disclaimer: code written in browser, not compiled or tested, use at own risk]
This should at least fix the most obvious problem with your code.

Finding min and max by the cuda kernel reduction

Here is my kernel code
typedef unsigned char Npp8u;
...
// Kernel Implementation
__device__ unsigned int min_device;
__device__ unsigned int max_device;
__global__ void findMax_Min(Npp8u * data, int numEl){
int index = blockDim.x*blockIdx.x + threadIdx.x;
int shared_index = threadIdx.x;
__shared__ Npp8u data_shared_min[BLOCKDIM];
__shared__ Npp8u data_shared_max[BLOCKDIM];
// check index condition
if(index < numEl){
data_shared_min[shared_index] = data[index]; //pass values from global to shared memory
__syncthreads();
data_shared_max[shared_index] = data[index]; //pass values from global to shared memory
for (unsigned int stride = BLOCKDIM/2; stride > 0; stride >>= 1) {
if(threadIdx.x < stride){
if(data_shared_max[threadIdx.x] < data_shared_max[threadIdx.x+stride]) data_shared_max[shared_index] = data_shared_max[shared_index+stride];
if(data_shared_min[threadIdx.x]> data_shared_min[threadIdx.x+stride]) data_shared_min[shared_index] = data_shared_min[shared_index+stride];
}
__syncthreads();
}
if(threadIdx.x == 0 ){
atomicMin(&(min_device), (unsigned int)data_shared_min[threadIdx.x ]);
//min_device =10;
__syncthreads();
atomicMax(&(max_device), (unsigned int)data_shared_max[threadIdx.x ]);
}
}else{
data_shared_min[shared_index] = 9999;
}
}
I have an image that is 512x512 and I want to find the min and max pixel values. data is the 1-D version of the image. This code works for max but not for min value. As I checked from matlab max value is 202 and min value is 10 but it finds 0 for the min value. Here is my kernel codes and memcpy calls
int main(){
// Host parameter declarations.
Npp8u * imageHost;
int nWidth, nHeight, nMaxGray;
// Load image to the host.
std::cout << "Load PGM file." << std::endl;
imageHost = LoadPGM("lena_before.pgm", nWidth, nHeight, nMaxGray);
// Device parameter declarations.
Npp8u * imageDevice;
unsigned int max, min;
size_t size = sizeof(Npp8u)*nWidth*nHeight;
cudaMalloc((Npp8u**)&imageDevice, size);
cudaMemcpy(imageDevice, imageHost, size, cudaMemcpyHostToDevice);
int numPixels = nWidth*nHeight;
dim3 numThreads(BLOCKDIM);
dim3 numBlocks(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1));
findMax_Min<<<numBlocks, numThreads>>>(imageDevice,numPixels);
cudaMemcpyFromSymbol(&max,max_device, sizeof(max_device), 0, cudaMemcpyDeviceToHost);
cudaMemcpyFromSymbol(&min,min_device, sizeof(min_device), 0, cudaMemcpyDeviceToHost);
printf("Min value for image : %i\n", min);
printf("Max value for image : %i\n", max);
...
Another interesting thing is changing the order of cudaMemcpy just after the kernel call also causes malfunctioning and values both are read as zero. I do not see the problem. Is there anyone sees the obstructed part?
You might want to do cuda error checking. You might also want to initialize min_device to a large value and max_device to zero. There are other problems with your reduction method related to stride (what happens in the last block of an odd size image when you add stride to threadIdx.x, it may exceed the defined image range in shared memory) , but I don't think it matters for a 512x512 image. If min_device just happened to start out at zero, all of your atomicMin operations would always leave zero there.
You can try initializing min_device and max_device like this:
__device__ unsigned int min_device = 9999;
__device__ unsigned int max_device = 0;
For the cudamemcpy calls at the end, you are copying 4 bytes (size of max_device) into a one-byte variable (Npp8u max) and likewise for min. So that's a problem. Since you're using pointers, the copy operation is definitely overwriting something that you don't intend. If the compiler stores the variables sequentially the way you have them defined, one copy operation is overwriting the other variable, which I think would explain the behavior you're seeing. If you created min and max as unsigned int quantities, I think this problem would go away.
EDIT: Since you haven't shown your actual block dimensions, it's possible that you still have a problem with your reduction. You might want to change this line:
if(threadIdx.x < stride){
To something like:
if((threadIdx.x < stride) && ((index + stride)< numEl)){
This or something like it should correct the hazard I mention in the first paragraph. I guess you're trying to account for the hazard using this line:
data_shared_min[shared_index] = 9999;
But there's no guarantee that line of code gets executed before the data element that it is setting in shared memory gets read by some other thread. I also don't know what happens when you assign a value of 9999 to a byte quantity, but it's probably not what you expect.

CUDA: bad performance with shared memory and no parallelism

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:
__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;
// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];
// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();
// AND each int bitwise
for(j = 0; j < b1_size; j++)
shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);
// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}
The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:
void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{
int* _memory_device;
int* b1_memory;
int* b2_memory;
int b1_size = b1.getIntSize();
// allocate memory on GPU
(cudaMalloc((void **)&b1_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device, _memSizeInt * SIZE_UINT));
// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);
// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));
// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}
b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you
I did not really understood what your kernel wants to do.
You should read more about CUDA and GPU programming.
I tried to point out some of the mistakes.
Shared memory (sm) should reduce global memory reads.
Analyze your global memory (gm) read and write operations per thread.
a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm
So in total you have nothing gained. You could directly use the global memory.
You use all threads to write out one value at the block index "i".
You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.
You use a loop and don't use the loop counter at all.
You write at "tid" and read at "i" randomly.
This assignement is overhead.
unsigned int tid = threadIdx.x;
The results cannot be correct with more then one block since with one block tid = i!
All the wrong indexing results in wrong calculation using more then one block
The shared memory at "i" was never written!
_memory_device[i] = shared_memory_data[i];
My assumption what your kernel should do
/*
* Call kernel with x-block usage and up to 3D Grid
*/
__global__ void bitwiseAnd(int* outData_g,
const long long int inSize_s,
const int* inData1_g,
const int* inData2_g)
{
//get unique block index
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//get unique thread index
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
//check global unique thread range
if(threadId >= inSize_s)
return;
//output bitwise and
outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}
When you declare an extern __shared__ array, you must also specify its size in the kernel call.
The kernel configuration is:
<<< Dg, Db, Ns, S >>>
Ns is the size of the extern __shared__ arrays, and defaults to 0.
I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}

CUDA-Kernel supposed to be dynamic crashes depending upon block size

I want to do a Sparse Matrix, Dense Vector multiplication. Lets assume the only storage format for compressing the entries in the Matrix is compressed row storage CRS.
My kernel looks like the following:
__global__ void
krnlSpMVmul1(
float *data_mat,
int num_nonzeroes,
unsigned int *row_ptr,
float *data_vec,
float *data_result)
{
extern __shared__ float local_result[];
local_result[threadIdx.x] = 0;
float vector_elem = data_vec[blockIdx.x];
unsigned int start_index = row_ptr[blockIdx.x];
unsigned int end_index = row_ptr[blockIdx.x + 1];
for (int index = (start_index + threadIdx.x); (index < end_index) && (index < num_nonzeroes); index += blockDim.x)
local_result[threadIdx.x] += (data_mat[index] * vector_elem);
__syncthreads();
// Reduction
// Writing accumulated sum into result vector
}
As you can see the kernel is supposed to be as naive as possible and it even does a few things wrong (e.g. vector_elem is just not always the correct value). I am aware of those things.
Now to my problem:
Suppose I am using a blocksize of 32 or 64 threads. As soon as a row in my matrix has more than 16 nonzeroes (e.g. 17) only the first 16 multiplications are done and save to shared memory. I know that the value at local_result[16] which is the result of the 17th multiplication is just zero. Using a blocksize of 16 or 128 threads fixes the explained problem.
Since I am fairly new to CUDA I might have overlooked the simplest thing but I cannot make up any more situations to look at.
Help is very much appreciated!
Edit towards talonmies comment:
I printed the values which were in local_result[16] directly after the computation. It was 0. Nevertheless, here is the missing code:
The reduction part:
int k = blockDim.x / 2;
while (k != 0)
{
if (threadIdx.x < k)
local_result[threadIdx.x] += local_result[threadIdx.x + k];
else
return;
__syncthreads();
k /= 2;
}
and how I write the results back to global memory:
data_result[blockIdx.x] = local_result[0];
Thats all I got.
Right now I am testing a scenario with a matrix consisting of a single row with 17 element which all are non-zeroes. The buffers look like this in pseudocode:
float data_mat[17] = { val0, .., val16 }
unsigned int row_ptr[2] = { 0, 17 }
float data_vec[17] = { val0 } // all values are the same
float data_result[1] = { 0 }
And thats an excerpt of my wrapper function:
float *dev_data_mat;
unsigned int *dev_row_ptr;
float *dev_data_vec;
float *dev_data_result;
// Allocate memory on the device
HANDLE_ERROR(cudaMalloc((void**) &dev_data_mat, num_nonzeroes * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_row_ptr, num_row_ptr * sizeof(unsigned int)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_vec, dim_x * sizeof(float)));
HANDLE_ERROR(cudaMalloc((void**) &dev_data_result, dim_y * sizeof(float)));
// Copy each buffer into the allocated memory
HANDLE_ERROR(cudaMemcpy(
dev_data_mat,
data_mat,
num_nonzeroes * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_row_ptr,
row_ptr,
num_row_ptr * sizeof(unsigned int),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_vec,
data_vec,
dim_x * sizeof(float),
cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(
dev_data_result,
data_result,
dim_y * sizeof(float),
cudaMemcpyHostToDevice));
// Calc grid dimension and block dimension
dim3 grid_dim(dim_y);
dim3 block_dim(BLOCK_SIZE);
// Start kernel
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(
dev_data_mat,
num_nonzeroes,
dev_row_ptr,
dev_data_vec,
dev_data_result);
I hope this is straightforward but will explain things if it is of any interest.
One more thing: I just realized that using a BLOCK_SIZE of 128 and having 33 nonzeroes makes the kernel fail as well. Again just the last value is not being computed.
Your dynamically allocated shared memory size is incorrect. Right now you are doing this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE>>>(.....)
The shared memory size should be given in bytes. Using your 64 threads per block case, that means you would be allocating enough shared memory for 16 float sized words and explains why the magic 17 entries per row case results in failure - you have a shared buffer overflow which will trigger a protection fault in the GPU and abort the kernel.
You should be doing something like this:
krnlSpMVmul1<<<grid_dim, block_dim, BLOCK_SIZE * sizeof(float)>>>(.....)
That will give you the correct dynamic shared memory size and should eliminate the problem.

cuda shared memory overwrite?

I am trying to write a parallel prefix scan on cuda by following this tutorial -
I am trying the work-inefficient 'double buffered one' as explained in the tutorial.
This is what I have:
// double buffered naive.
// d = number of iterations, N - size, and input.
__global__ void prefixsum(int* in, int d, int N)
{
//get the block index
int idx = blockIdx.x*blockDim.x + threadIdx.x;
// allocate shared memory
extern __shared__ int temp_in[], temp_out[];
// copy data to it.
temp_in[idx] = in[idx];
temp_out[idx] = 0;
// block until all threads copy
__syncthreads();
int i = 1;
for (i; i<=d; i++)
{
if (idx < N+1 && idx >= (int)pow(2.0f,(float)i-1))
{
// copy new result to temp_out
temp_out[idx] += temp_in[idx - (int)pow(2.0f,(float)i-1)] + temp_in[idx];
}
else
{
// if the element is to remain unchanged, copy the same thing
temp_out[idx] = temp_in[idx];
}
// block until all theads do this
__syncthreads();
// copy the result to temp_in for next iteration
temp_in[idx] = temp_out[idx];
// wait for all threads to do so
__syncthreads();
}
//finally copy everything back to global memory
in[idx] = temp_in[idx];
}
Can you point out what's wrong with this? I have written comments for what I think should happen.
This is the kernel invocation -
prefixsum<<<dimGrid,dimBlock>>>(d_arr, log(SIZE)/log(2), N);
This is the grid and block allocations:
dim3 dimGrid(numBlocks);
dim3 dimBlock(numThreadsPerBlock);
The problem is that I don't get the correct output for any input that's more than 8 elements long.
I see two problems in your code
Problem 1: extern shared memory
Agh.... I hate extern __shared__ memory. The problem is, that the compiler does not know how big are the arrays. As a result, they both point to the same piece of memory!
So, in your case: temp_in[5] and temp_out[5] refer to the same word in shared memory.
If you really want the extern __shared__ memory, you can manually offset the second array, for example something like this:
size_t size = .... //the size of your array
extern __shared__ int memory[];
int* temp_in=memory;
int* temp_out=memory+size;
Problem 2: Shared array index
Shared memory is private for each block. That is, temp[0] in one block can be different than temp[0] in another block. However, you index it by blockIdx.x*blockDim.x + threadIdx.x as if the temp arrays were shared between the blocks.
Instead, you should most likely index your temp arrays just by threadIdx.x.
Of course, the idx array is global and you index that one correctly.