CUDA Dot Product - cuda

I'm trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i'm doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn't seem to be working properly since every time i run my kernel with the same data,i receive different results. I'll be grateful if somebody could spot the mistake or provide an alternative solution!
Here is my kernel:
__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
__shared__ double cache[threadsPerBlock]; //thread shared memory
int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
int i=0,cacheIndex=0;
double temp = 0;
cacheIndex = threadIdx.x;
while (global_tid < (*n)) {
temp += a[global_tid] * b[global_tid];
global_tid += blockDim.x * gridDim.x;
}
cache[cacheIndex] = temp;
__syncthreads();
for (i=blockDim.x/2; i>0; i>>=1) {
if (threadIdx.x < i) {
cache[threadIdx.x] += cache[threadIdx.x + i];
}
__syncthreads();
}
__syncthreads();
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
}
And here is my device function atomicAdd:
__device__ double cuda_atomicAdd(double *address, double val)
{
double assumed,old=*address;
do {
assumed=old;
old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
__double_as_longlong(assumed),
__double_as_longlong(val+assumed)));
}while (assumed!=old);
return old;
}

Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:
#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>
double do_dot_product(int n, double *a, double *b)
{
// wrap raw pointers to device memory with device_ptr
thrust::device_ptr<double> d_a(a), d_b(b);
// inner_product implements a mathematical dot product
return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}

You are using the cuda_atomicAdd function incorrectly. This section of your kernel:
if (cacheIndex==0) {
*dot_res=cuda_atomicAdd(dot_res,cache[0]);
}
is the culprit. Here, you atomically add to dot_res. then non atomically set dot_res with the result it returns. The return result from this function is the previous value of the location being atomically updated, and it supplied for "information" or local use of the caller only. You don't assign it to what you are atomically updated, that completely defeats the purpose of using atomic memory access in the first place. Do something like this instead:
if (cacheIndex==0) {
double result=cuda_atomicAdd(dot_res,cache[0]);
}

Did not checked your code that depth but here are some advices.
I would only advice using Thrust if you only use your GPU for such generic tasks, since if a complex problem will arise people have no idea to efficiently program parallel on the gpu.
Start a new parallel reduction kernel to summarize the dot product.
Since the data is already on the device you won't see a decrease in performance starting a new kernel.
Your kernel seems not to scale across the maximum number of possible blocks on the newest GPU. If it would and your kernel would be able to calculate the dot product of millions of values the performance would decrease dramatically because of the serialized atomic operation.
Beginner mistake: Is your input data and shared memory access range checked? Or are you sure the input data is always multiple of your block size? Else you will read garbage. Most of my wrong results were due to this fault.
optimise your parallel reduction. My Thesis or Optimisations Mark Harris
Untested, i just wrote it down in notepad:
/*
* #param inCount_s unsigned long long int Length of both input arrays
* #param inValues1_g double* First value array
* #param inValues2_g double* Second value array
* #param outDots_g double* Output dots of each block, length equals the number of blocks
*/
__global__ void dotProduct(const unsigned long long int inCount_s,
const double* inValuesA_g,
const double* inValuesB_g,
double* outDots_g)
{
//get unique block index in a possible 3D Grid
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//block dimension uses only x-coordinate
const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;
/*
* shared value pair products array, where BLOCK_SIZE power of 2
*
* To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element!
* (outDots_g length decreases by same factor, and you need to range check and initialize memory)
* -> see harris gpu optimisations / parallel reduction slides for more informations.
*/
__shared__ double dots_s[BLOCK_SIZE];
/*
* initialize shared memory array and calculate dot product of two values,
* shared memory always needs to be initialized, its never 0 by default, else garbage is read later!
*/
if(tId < inCount_s)
dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId];
else
dots_s[threadIdx.x] = 0;
__syncthreads();
//do parallel reduction on shared memory array to sum up values
reductionAdd(dots_s, dots_s[0]) //see my thesis link
//output value
if(threadIdx.x == 0)
outDots_g[0] = dots_s[0];
//start new parallel reduction kernel to sum up outDots_g!
}
Edit: removed unnecessary points.

Related

CUDA dot product gives wrong results

I wrote a dot product code with CUDA to compute the dot product of two double vectors. The kernel was invoked by N threads(N<1024) 1 block. But it can't give correct results. I can't figure it out.
__global__ void dotthread(double* a, double *b,double *sum, int N)
{
int tid = threadIdx.x;
sum[0] = sum[0]+a[tid] * b[tid]; //every thread write to the sum[0]
__syncthreads();
}
Let's look at two of your three lines of code:
sum[0] = sum[0]+a[tid] * b[tid]; //every thread write to the sum[0]
__syncthreads();
The first line contains a memory race. Every thread in the block will simultaneously attempt to write to sum[0]. There is nothing in the cuda execution model which can stop this from happening. There is no automatic serialization or memory protections which can stop this behaviour.
The second line is an instruction barrier. This means that each warp of threads will be blocked until every warp of threads has reached the barrier. It has no effect whatsoever on prior instructions, it has no effect whatsoever on memory consistency or on the behaviour of any memory transactions which your code issues.
The code you have written is irreversibly broken. The canonical way to perform this sort of operation would be via a parallel reduction. There are a number of different ways this can be done. It is probably the most described and documented parallel algorithm for GPUs. It you have installed the CUDA toolkit, you already have both a complete working example and a comprehensive paper describing the algorithm as it would be implemented using shared memory. I suggest you study it.
You can see an (almost) working implementation of a dot product using shared memory here which I recommend you study as well. You can also find optimal implementations of the parallel block reduction in libraries such as cub
I wrote two versions of dot product procedures. The first one uses the atomiAdd function, the second one allocates one shared variable for each block.
The computational time is 3.33 ms and 0.19 ms respectively compared to 0.17 ms and 411.43 ms of the reduction dot product and a one thread dot product.
GPU Device 0: "GeForce GTX 1070 Ti" with compute capability 6.1
2000000flow array allocated 2147483647
naive elapsedTimeIn Ms 411.437042 Ms
sum is 6.2831853071e+06
thread atomic add elapsedTimeIn Ms 3.3371520042 Ms
sum is 6.2831853071e+06
cache reduction elapsedTimeIn Ms 0.1764480025 Ms
sum is 6.2831853072e+06
cache atomicadd elapsedTimeIn Ms 0.1914239973 Ms
sum is 6.2831853072e+06
__global__ void dotatomicAdd(double* a, double *b,double *sum, int N)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
while (tid < N){
double t=a[tid] * b[tid];
atomicAdd(sum,t);
tid+=blockDim.x * gridDim.x;
}
}
__global__ void dotcache(double* a, double *b,double *c, int N)
{
__shared__ double cache;
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int cacheIndex = threadIdx.x;
double temp = 0.0;
cache=0.0;
__syncthreads();
while (tid < N) {
temp += a[tid] * b[tid];
tid += blockDim.x * gridDim.x;
}
atomicAdd(&cache,temp);
__syncthreads();
if (cacheIndex == 0) c[blockIdx.x] = cache;
}

Can multiple blocks and threads write to the same output?

I have the following CUDA kernel code which computes the sum squared error of two arrays.
__global__ void kSquaredError(double* data, double* recon, double* error,
unsigned int num_elements)
{
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
*error += pow(data[i] - recon[i], 2);
}
}
I need a single scalar output (error). In this case, it seems like all threads are writing to error simultaneously. Is there some way I need to synchronize it?
Currently I'm getting a bad result so I'm guessing there is some issue.
The implementation you are doing now is subject to race conditions due to the fact that all threads try to update the same global memory address at the same time. You could easily put a atomicAdd function instead of *error += pow... but that suffers from performance issues due to it being serialized on each update.
Instead you should try and and do a reduction using the shared memory, as following:
_global__ void kSquaredError(double* data, double* recon, double* error, unsigned int num_elements) {
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tid = threadIdx.x;
__shared__ double serror[blockDim.x];//temporary storage of each threads error
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
serror[tid] += pow(data[i] - recon[i], 2);//put each threads value in shared memory
}
__syncthreads();
int i = blockDim.x >> 1; //halve the threads
for(;i>0;i>>=1) {//reduction in shared memory
if(tid<i) {
serror[tid] += serror[tid+i];
__syncthreads();//make shure all threads are at the same state and have written to shared memory
}
}
if(tid == 0) {//thread 0 updates the value in global memory
atomicAdd(error,serror[tid]);// same as *error += serror[tid]; but atomic
}
}
It works by the following principle, each thread have its own temporary variable where it calculates the sum of the error for all its input, when it have finished all threads converge at the __syncthreads instruction to ensure that all data is complete.
Now half of all the threads in the block will take one value from the corresponding other half add add it to its own, half the threads again and do it again until you are left with one thread(thread 0) which will have the total sum.
Now thread 0 will uppdate the global memory with an atomicAdd function to avoid race condition with other blocks if there is any.
If we would just use the first example and use atomicAdd on every assignment. You would have gridDim.x*blockDim.x*num_elements atomic functions that would be serialized, now we have only gridDim.x atomic functions which is a lot less.
See Optimizing Parallel Reduction in CUDA for further reading on how reduction using cuda works.
Edit
Added if statement in the reduction for loop, forgot that.

CUDA efficient division?

I would like to know if there is, by any chance an efficient way of dividing elements of an array. I am running with matrix values 10000x10000 and it a considerable amount of time in comparison with other kernels. Division are expensive operations, and I can't see how to improve it.
__global__ void division(int N, float* A, int* B){
int row = blockIdx.x * blockDim.x + threadIdx.x;
int col = blockIdx.y * blockDim.y + threadIdx.y;
if((row < N) && (col <= row) ){
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
kernel launched with
int N = 10000;
int threads = 32
int blocks = (N+threads-1)/threads
dim3 t(threads,threads);
dim3 b(blocks, blocks);
division<<< b, t >>>(N, A, B);
cudaThreadSynchronize();
Option B:
__global__ void division(int N, float* A, int* B){
int k = blockIdx.x * blockDim.x + threadIdx.x;
int kmax = N*(N+1)/2
int i,j;
if(k< kmax){
row = (int)(sqrt(0.25+2.0*k)-0.5);
col = k - (row*(row+1))>>1;
if( B[row*N+col] >0 )
A[row*N+col] /= (float)B[row*N+col];
}
}
launched with
int threads =192;
int totalThreadsNeeded = (N*(N+1)/2;
int blocks = ( threads + (totalThreadsNeeded)-1 )/threads;
division<<<blocks, threads >>>(N, A, B);
Why is option B giving a wrong result even if the threadIds are the correct one? what is missing here?
Your basic problem is that you are launching an improbably huge grid (over 100 million threads for your 10000x10000 array example), and then because of the triangular nature of the access pattern in the kernel, fully half of those threads never do anything productive. So a enormous amount of GPU cycles are being wasted for no particularly good reason. Further, the access pattern you are using isn't allowing coalesced memory access, which is going to further reduce the performance of the threads which are actually doing useful work.
If I understand your problem correctly, the kernel is only performing element-wise division on a lower-triangle of a square array. If this is the case, it could be equally done using something like this:
__global__
void division(int N, float* A, int* B)
{
for(int row=blockIdx.x; row<N; row+=gridDim.x) {
for(int col=threadIdx.x; col<=row; col+=blockDim.x) {
int val = max(1,B[row*N+col]);
A[row*N+col] /= (float)val;
}
}
}
[disclaimer: written in browser, never compiled, never tested, use at own risk]
Here, a one dimension grid is used, with each block computing a row at a time. Threads in a block move along the row, so memory access is coalesced. In comments you mention your GPU is a Tesla C2050. That device only requires 112 blocks of 192 threads each to completely "fill" each of the 14 SM with a full complement of 8 blocks each and the maximum number of concurrent threads per SM. So the launch parameters could be something like:
int N = 10000;
int threads = 192;
int blocks = min(8*14, N);
division<<<blocks, threads>>>(N, A, B);
I would expect this to run considerably faster than your current approach. If numerical accuracy isn't that important, you can probably achieve further speed-up by replacing the division with an approximate reciprocal intrinsic and a floating point multiply.
Because threads are executed in groups of 32, called warps, you are paying for the division for all 32 threads in a warp if both if conditions are true for just one of the threads. If the condition is false for many threads, see if you can filter out the values for which the division is not needed in a separate kernel.
The int to float conversion may itself be slow. If so, you might be able to generate floats directly in your earlier step, and pass B in as an array of floats.
You may be able to generate inverted numbers in the earlier step, where you generate the B array. If so, you can use multiplication instead of division in this kernel. (a / b == a * 1 / b).
Depending on your algorithm, maybe you can get away with a lower precision division. There's an intrinsic, __fdividef(x, y), that you can try. There is also a compiler flag, -prec-div=false.
The very first thing to look at should be coalesced memory access. There is no reason for the non-coalesced pattern here, just exchange rows and columns for to avoid wasting a lot of memory bandwidth:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
...
A[row*N+col] ...
Even if this is run on compute capability 2.0 or higher, the caches are not large enough to remedy this suboptimal pattern.

CUDA: bad performance with shared memory and no parallelism

I'm trying to exploit shared memory in this kernel function, but the performance are not as good as I was expecting. This function is called many times in my application (about 1000 times or more), so I was thinking to exploit shared memory to avoid the memory latency. But something is wrong apparently because my application became really slow since i'm using shared memory.
This is the kernel:
__global__ void AndBitwiseOperation(int* _memory_device, int b1_size, int* b1_memory, int* b2_memory){
int j = 0;
// index GPU - Transaction-wise
unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int tid = threadIdx.x;
// shared variable
extern __shared__ int shared_memory_data[];
extern __shared__ int shared_b1_data[];
extern __shared__ int shared_b2_data[];
// copy from global memory into shared memory and sync threads
shared_b1_data[tid] = b1_memory[tid];
shared_b2_data[tid] = b2_memory[tid];
__syncthreads();
// AND each int bitwise
for(j = 0; j < b1_size; j++)
shared_memory_data[tid] = (shared_b1_data[tid] & shared_b2_data[tid]);
// write result for this block to global memory
_memory_device[i] = shared_memory_data[i];
}
The shared variables are declared extern because I don't know the size of b1 and b2 since they depend from the number of customer that I can only know at runtime (but both have the same size all the times).
This is how I call the kernel:
void Bitmap::And(const Bitmap &b1, const Bitmap &b2)
{
int* _memory_device;
int* b1_memory;
int* b2_memory;
int b1_size = b1.getIntSize();
// allocate memory on GPU
(cudaMalloc((void **)&b1_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&b2_memory, _memSizeInt * SIZE_UINT));
(cudaMalloc((void **)&_memory_device, _memSizeInt * SIZE_UINT));
// copy values on GPU
(cudaMemcpy(b1_memory, b1._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(b2_memory, b2._memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
(cudaMemcpy(_memory_device, _memory, _memSizeInt * SIZE_UINT, cudaMemcpyHostToDevice ));
dim3 dimBlock(1, 1);
dim3 dimGrid(1, 1);
AndBitwiseOperation<<<dimGrid, dimBlock>>>(_memory_device, b1_size, b1_memory, b2_memory);
// return values
(cudaMemcpy(_memory, _memory_device, _memSizeInt * SIZE_UINT, cudaMemcpyDeviceToHost ));
// Free Memory
(cudaFree(b1_memory));
(cudaFree(b2_memory));
(cudaFree(_memory_device));
}
b1 and b2 are bitmaps with 4 bits for each element. The number of elements depend from the number of customers. Also, I have problem with the kernel's parameters, because if I add some blocks or threads, the AndBitwiseOperation() is not giving me the correct result. With just 1 block and 1 thread per block the result is correct but the kernel is not in parallel.
Every advice is welcomed :)
Thank you
I did not really understood what your kernel wants to do.
You should read more about CUDA and GPU programming.
I tried to point out some of the mistakes.
Shared memory (sm) should reduce global memory reads.
Analyze your global memory (gm) read and write operations per thread.
a. You read global memory two times and write sm two times
b. (nonsense loop ignored, no use of index) you read two times sn and write once sm
c. you read once sm and write once gm
So in total you have nothing gained. You could directly use the global memory.
You use all threads to write out one value at the block index "i".
You should only use one thread to write this data out.
It makes no sense outputing the same data by multiple threads which will get serialized.
You use a loop and don't use the loop counter at all.
You write at "tid" and read at "i" randomly.
This assignement is overhead.
unsigned int tid = threadIdx.x;
The results cannot be correct with more then one block since with one block tid = i!
All the wrong indexing results in wrong calculation using more then one block
The shared memory at "i" was never written!
_memory_device[i] = shared_memory_data[i];
My assumption what your kernel should do
/*
* Call kernel with x-block usage and up to 3D Grid
*/
__global__ void bitwiseAnd(int* outData_g,
const long long int inSize_s,
const int* inData1_g,
const int* inData2_g)
{
//get unique block index
const unsigned long long int blockId = blockIdx.x //1D
+ blockIdx.y * gridDim.x //2D
+ gridDim.x * gridDim.y * blockIdx.z; //3D
//get unique thread index
const unsigned long long int threadId = blockId * blockDim.x + threadIdx.x;
//check global unique thread range
if(threadId >= inSize_s)
return;
//output bitwise and
outData_g[thread] = inData1_g[thread] & inData2_g[thread];
}
When you declare an extern __shared__ array, you must also specify its size in the kernel call.
The kernel configuration is:
<<< Dg, Db, Ns, S >>>
Ns is the size of the extern __shared__ arrays, and defaults to 0.
I don't think you can define more than one extern __shared__ array in your kernel. An example in the Programming Guide defines a single extern __shared__ array and manually sets arrays with offsets within it:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}

CUDA: Shared memory over a large-ish 2D array

I had a simple CUDA problem for a class assignment, but the professor added an optional task to implement the same algorithm using shared memory instead. I was unable to finish it before the deadline (as in, the turn-in date was a week ago) but I'm still curious so now I'm going to ask the internet ;).
The basic assignment was to implement a bastardized version of a red-black successive over-relaxation both sequentially and in CUDA, make sure you got the same result in both and then compare the speedup. Like I said, doing it with shared memory was an optional +10% add-on.
I'm going to post my working version and pseudocode what I've attempted to do since I don't have the code in my hands at the moment, but I can update this later with the actual code if someone needs it.
Before anyone says it: Yes, I know using CUtil is lame, but it made the comparison and timers easier.
Working global memory version:
#include <stdlib.h>
#include <stdio.h>
#include <cutil_inline.h>
#define N 1024
__global__ void kernel(int *d_A, int *d_B) {
unsigned int index_x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int index_y = blockIdx.y * blockDim.y + threadIdx.y;
// map the two 2D indices to a single linear, 1D index
unsigned int grid_width = gridDim.x * blockDim.x;
unsigned int index = index_y * grid_width + index_x;
// check for boundaries and write out the result
if((index_x > 0) && (index_y > 0) && (index_x < N-1) && (index_y < N-1))
d_B[index] = (d_A[index-1]+d_A[index+1]+d_A[index+N]+d_A[index-N])/4;
}
main (int argc, char **argv) {
int A[N][N], B[N][N];
int *d_A, *d_B; // These are the copies of A and B on the GPU
int *h_B; // This is a host copy of the output of B from the GPU
int i, j;
int num_bytes = N * N * sizeof(int);
// Input is randomly generated
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
A[i][j] = rand()/1795831;
//printf("%d\n",A[i][j]);
}
}
cudaEvent_t start_event0, stop_event0;
float elapsed_time0;
CUDA_SAFE_CALL( cudaEventCreate(&start_event0) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event0) );
cudaEventRecord(start_event0, 0);
// sequential implementation of main computation
for(i=1;i<N-1;i++) {
for(j=1;j<N-1;j++) {
B[i][j] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4;
}
}
cudaEventRecord(stop_event0, 0);
cudaEventSynchronize(stop_event0);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time0,start_event0, stop_event0) );
h_B = (int *)malloc(num_bytes);
memset(h_B, 0, num_bytes);
//ALLOCATE MEMORY FOR GPU COPIES OF A AND B
cudaMalloc((void**)&d_A, num_bytes);
cudaMalloc((void**)&d_B, num_bytes);
cudaMemset(d_A, 0, num_bytes);
cudaMemset(d_B, 0, num_bytes);
//COPY A TO GPU
cudaMemcpy(d_A, A, num_bytes, cudaMemcpyHostToDevice);
// create CUDA event handles for timing purposes
cudaEvent_t start_event, stop_event;
float elapsed_time;
CUDA_SAFE_CALL( cudaEventCreate(&start_event) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event) );
cudaEventRecord(start_event, 0);
// TODO: CREATE BLOCKS AND THREADS AND INVOKE GPU KERNEL
dim3 block_size(256,1,1); //values experimentally determined to be fastest
dim3 grid_size;
grid_size.x = N / block_size.x;
grid_size.y = N / block_size.y;
kernel<<<grid_size,block_size>>>(d_A,d_B);
cudaEventRecord(stop_event, 0);
cudaEventSynchronize(stop_event);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time,start_event, stop_event) );
//COPY B BACK FROM GPU
cudaMemcpy(h_B, d_B, num_bytes, cudaMemcpyDeviceToHost);
// Verify result is correct
CUTBoolean res = cutComparei( (int *)B, (int *)h_B, N*N);
printf("Test %s\n",(1 == res)?"Passed":"Failed");
printf("Elapsed Time for Sequential: \t%.2f ms\n", elapsed_time0);
printf("Elapsed Time for CUDA:\t%.2f ms\n", elapsed_time);
printf("CUDA Speedup:\t%.2fx\n",(elapsed_time0/elapsed_time));
cudaFree(d_A);
cudaFree(d_B);
free(h_B);
cutilDeviceReset();
}
For the shared memory version, this is what I've tried so far:
#define N 1024
__global__ void kernel(int *d_A, int *d_B, int width) {
//assuming width is 64 because that's the biggest number I can make it
//each MP has 48KB of shared mem, which is 12K ints, 32 threads/warp, so max 375 ints/thread?
__shared__ int A_sh[3][66];
//get x and y index and turn it into linear index
for(i=0; i < width+2; i++) //have to load 2 extra values due to the -1 and +1 in algo
A_sh[index_y%3][i] = d_A[index+i-1]; //so A_sh[index_y%3][0] is actually d_A[index-1]
__syncthreads(); //and hope that previous and next row have been loaded by other threads in the block?
//ignore boundary conditions because it's pseudocode
for(i=0; i < width; i++)
d_B[index+i] = A_sh[index_y%3][i] + A_sh[index_y%3][i+2] + A_sh[index_y%3-1][i+1] + A_sh[index_y%3+1][i+1];
}
main(){
//same init as above until threads/grid init
dim3 threadsperblk(32,16);
dim3 numblks(32,64);
kernel<<<numblks,threadsperblk>>>(d_A,d_B,64);
//rest is the same
}
This shared mem code crashes ("launch failed due to unspecified error") since I haven't caught all the boundary conditions yet, but I'm not worried about that as much as finding the correct way to get things going. I feel that my code is way too complicated to be the correct path (especially compared to the SDK examples), but I also can't see another way to do it since my array doesn't fit into shared mem like all the examples I can find.
And frankly, I'm not sure it would be that much faster on my hardware (GTX 560 Ti - runs the global memory version in 0.121ms), but I need to prove it to myself first :P
Edit 2: For anyone who runs across this in the future, the code in the answer is a good starting point if you want to do some shared memory.
The key to getting the maximum out of these sort of stencil operators in CUDA is data re-usage. I have found that the best approach is usually to have each block "walk" through a dimension of the grid. After the block has loaded an initial tile of data into shared memory, only a single dimension (so row in a row-major order 2D problem ) needs to be read from global memory to have the necessary data in shared memory for the second and subsequent row calculations. The rest of the data can just be reused. To visualise how the shared memory buffer looks through the first four steps of this sort of algorithm:
Three "rows" (a,b,c) of the input grid are loaded to shared memory, and the stencil computed for row (b) and written to global memory
aaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (d) is loaded into the shared memory buffer, replacing row (a), and the calculations made for row (c) using a different stencil, reflecting where the row data is in shared memory
dddddddddddddddd
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (e) is loaded into the shared memory buffer, replacing row (b), and the calculations made for row (d), using a different stencil from either step 1 or 2.
dddddddddddddddd
eeeeeeeeeeeeeeee
cccccccccccccccc
Another row (f) is loaded into the shared memory buffer, replacing row (c), and the calculations made for row (e). Now the data is back to the same layout as used in step 1, and the same stencil used in step 1 can be used.
dddddddddddddddd
eeeeeeeeeeeeeeee
ffffffffffffffff
The whole cycle repeats until the block has traverse full column length of the input grid. The reason for using different stencils rather than shifting the data in the shared memory buffer is down to performance - shared memory only has about 1000 Gb/s bandwidth on Fermi, and the shifting of data will become a bottleneck in fully optimal code. You should try different buffer sizes, because you might find smaller buffers allows for higher occupancy and improved kernel throughput.
EDIT: To give a concrete example of how that might be implemented:
template<int width>
__device__ void rowfetch(int *in, int *out, int col)
{
*out = *in;
if (col == 1) *(out-1) = *(in-1);
if (col == width) *(out+1) = *(in+1);
}
template<int width>
__global__ operator(int *in, int *out, int nrows, unsigned int lda)
{
// shared buffer holds three rows x (width+2) cols(threads)
__shared__ volatile int buffer [3][2+width];
int colid = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x + 1;
int * rowpos = &in[colid], * outpos = &out[colid];
// load the first three rows (compiler will unroll loop)
for(int i=0; i<3; i++, rowpos+=lda) {
rowfetch<width>(rowpos, &buffer[i][tid], tid);
}
__syncthreads(); // shared memory loaded and all threads ready
int brow = 0; // brow is the next buffer row to load data onto
for(int i=0; i<nrows; i++, rowpos+=lda, outpos+=lda) {
// Do stencil calculations - use the value of brow to determine which
// stencil to use
result = ();
// write result to outpos
*outpos = result;
// Fetch another row
__syncthreads(); // Wait until all threads are done calculating
rowfetch<width>(rowpos, &buffer[brow][tid], tid);
brow = (brow < 2) ? (brow+1) : 0; // Increment or roll brow over
__syncthreads(); // Wait until all threads have updated the buffer
}
}