I have two cudaArray, a1 and a2 (which have the same size) which reprensent two matrices .
Using texture memory, I want to multiplicate those two cudaArrays .
Then I want to copy back the result in one normal arrays,let's name it *a1_h.
The fact is, I just don't know how to do it . I've managed to define, allocate my two cudaArrays and to put floats into them .
Now I want to do a kernel which does those multiplications .
Can somebody help me ?
ROOM_X and ROOM_Y are int, they define width and height of matrices .
mytex_M1 and mytex_M2 are texture defined as : texture < float,2,cudaReadModeElementType > .
Here is my main :
int main(int argc, char * argv[]) {
int size = ROOM_X * ROOM_Y * sizeof(float);
//creation of arrays on host.Will be useful for filling the cudaArrays
float *M1_h, *M2_h;
//allocating memories on Host
M1_h = (float *)malloc(size);
M2_h = (float *)malloc(size);
//creation of channel descriptions for 2d texture
cudaChannelFormatDesc channelDesc_M1 = cudaCreateChannelDesc<float>();
cudaChannelFormatDesc channelDesc_M2 = cudaCreateChannelDesc<float>();
//creation of 2 cudaArray * .
cudaArray *M1_array,*M2_array;
//bind arrays and channel in order to allocate space
cudaMallocArray(&M1_array,&channelDesc_M1,ROOM_X,ROOM_Y);
cudaMallocArray(&M2_array,&channelDesc_M2,ROOM_X,ROOM_Y);
//filling the matrices on host
Matrix(M1_h);
Matrix(M2_h);
//copy from host to device (putting the initial values of M1 and M2 into the arrays)
cudaMemcpyToArray(M1_array, 0, 0,M1_h, size,cudaMemcpyHostToDevice);
cudaMemcpyToArray(M2_array, 0, 0,M2_h, size,cudaMemcpyHostToDevice);
//set textures parameters
mytex_M1.addressMode[0] = cudaAddressModeWrap;
mytex_M1.addressMode[1] = cudaAddressModeWrap;
mytex_M1.filterMode = cudaFilterModeLinear;
mytex_M1.normalized = true; //NB coordinates in [0,1]
mytex_M2.addressMode[0] = cudaAddressModeWrap;
mytex_M2.addressMode[1] = cudaAddressModeWrap;
mytex_M2.filterMode = cudaFilterModeLinear;
mytex_M2.normalized = true; //NB coordinates in [0,1]
//bind arrays to the textures
cudaBindTextureToArray(mytex_M1,M1_array);
cudaBindTextureToArray(mytex_M2,M2_array);
//allocate device memory for result
float* M1_d;
cudaMalloc( (void**)&M1_d, size);
//dimensions of grid and blocks
dim3 dimGrid(ROOM_X,ROOM_Y);
dim3 dimBlock(1,1);
//execution of the kernel . The result of the multiplication has to be put in M1_d
mul_texture<<<dimGrid, dimBlock >>>(M1_d);
//copy result from device to host
cudaMemcpy(M1_h,M1_d, size, cudaMemcpyDeviceToHost);
//free memory on device
cudaFreeArray(M1_array);
cudaFreeArray(M2_array);
cudaFree(M1_d);
//free memory on host
free(M1_h);
free(M2_h);
return 0;
}
When you declare a texture
A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#texture-reference-api
So, if you have successfully define the texture references, initialize the arrays, copy then to the texture space and prepare the output buffers (something that seems to be done according to your code), what you need to do is implement the kernel. For example:
__global__ void
mul_texture(float* M1_d, int w, int h)
{
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// take care of the size of the image, it's a good practice
if ( x < w && y < h )
{
// the output M1_d is actually represented as 1D array
// so the offset of each value is related to their (x,y) position
// in a tow-major order
int gid = x + y * w;
// As texture are declared at global scope,
// we can access their content at any kernel
float M1_value = tex2D(mytex_M1,x,y);
float M2_value = tex2D(mytex_M2,x,y);
// The final results is the pointwise multiplication
M1_d[ gid ] = M1_value * M2_value;
}
}
You need to change the kernel invocation to include the w and h values, corresponding to the width (number of columns in the matrix) and height (number of rows of the matrix).
mul_texture<<<dimGrid, dimBlock >>>(M1_d, ROOM_X, ROOM_Y);
Note you are not doing any error checking, something that will help you quite a lot now and in the future. I have not checked if the kernel provided in this answer works as your code didnt compile.
Related
I'm trying to understand the idea of the leading dimension in cuBLAS. It's mentioned that lda must always be greater than or equal to the # of rows in a matrix.
If I have a 100x100 matrix A and I wanted to access A(90:99, 0:99), what would be the arguments of cublasSetMatrix? lda specifies the number of rows between the elements in the same column(100 in this case), but where would I specify the 90? I can only see a way by adjusting *A.
The function definition is:
cublasStatus_t cublasSetMatrix(int rows, int cols, int elemSize, const void *A, int lda, void *B, int ldb)
And I'm also guessing that I wouldn't be able to transfer the bottom right 3x3 portion of a 5x5 matrix given the length limits.
You have to "adjust *A", as you called it. The pointer that is given to this function must be the starting entry of the respective sub-matrix.
You did not say whether your matrix A is actually the input- or the output matrix, but this should not change much, conceptually.
Assuming you have the following code:
// The matrix in host memory
int rowsA = 100;
int colsA = 100;
float *A = new float[rowsA*colsA];
// Fill A with values
...
// The sub-matrix that should be copied to the device.
// The minimum index is INCLUSIVE
// The maximum index is EXCLUSIVE
int minRowA = 0;
int maxRowA = 100;
int minColA = 90;
int maxColA = 100;
int rowsB = maxRowA-minRowA;
int colsB = maxColA-minColA;
// Allocate the device matrix
float *dB = nullptr;
cudaMalloc(&dB, rowsB * colsB * sizeof(float));
Then, for the cublasSetMatrix call, you have to compute the starting element of the source matrix:
float *sourceA = A + (minRowA + minColA * rowsA);
cublasSetMatrix(rowsB, colsB, sizeof(float), sourceA, rowsA, dB, rowsB);
And this is where the 90 that you asked for comes into play: It is the minColA in the computation of the source pointer.
Is there a way to read the values in a cudaArray from the device without wrapping it in a texture reference/object? All of the examples I've looked at use cudaArrays exclusively for creating textures. Is that the only way they can be used, or could I do something like:
__global__ kernel(cudaArray *arr, ...) {
float x = tex1D<float>(arr, ...);
...
}
cudaArray *arr;
cudaMallocArray(&arr, ...);
cudaMemcpyToArray(arr, ...);
kernel<<<...>>>(arr, ...);
So basically, what should go in place of tex1D there? Also, if this is possible I'd be curious if anyone thinks there would be any performance benefit to doing this, but I'll also be running my own tests to see.
Thanks!
cudaArray is defined for texturing or surface memory purposes. As indicated here:
CUDA arrays are opaque memory layouts optimized for texture fetching.
They are one dimensional, two dimensional, or three-dimensional and
composed of elements, each of which has 1, 2 or 4 components that may
be signed or unsigned 8 , 16 or 32 bit integers, 16 bit floats, or 32
bit floats. CUDA arrays are only accessible by kernels through texture
fetching as described in Texture Memory or surface reading and writing
as described in Surface Memory.
So in effect you have to use either texture functions or surface functions in kernels to access data in a cudaArray.
There are several performance benefit possibilities associated with using texturing. Texturing can imply interpolation (i.e. reading from a texture using floating point coordinates). Any application that needs this kind of data interpolation may benefit from the HW interpolation engines inside the texture units on the GPU.
Another benefit, perhaps the most important for using texturing in arbitrary GPU codes, is the texture cache that backs up the textures stored in global memory. Texturing is a read-only operation, but if you have an array of read-only data, the texture cache may improve or otherwise extend your ability to access data rapidly. This generally implies that there must be data-locality/ data-reuse in your functions that are accessing data stored in the texturing mechanism. Texture data retrieved will not disrupt anything in the L1 cache, so generally this kind of data segmentation/optimization would be part of a larger strategy around data caching. If there were no other demands on L1 cache, the texture mechanism/cache does not provide faster access to data than if it were in the L1 already.
Robert Crovella has already answered to your question. I believe it could be useful for next users to have a worked example for the two solutions: textures and sufaces.
#include <stdio.h>
#include <thrust\device_vector.h>
// --- 2D float texture
texture<float, cudaTextureType2D, cudaReadModeElementType> texRef;
// --- 2D surface memory
surface<void, 2> surf2D;
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
/*************************************/
/* cudaArray PRINTOUT TEXTURE KERNEL */
/*************************************/
__global__ void cudaArrayPrintoutTexture(int width, int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
printf("Thread index: (%i, %i); cudaArray = %f\n", x, y, tex2D(texRef, x / (float)width + 0.5f, y / (float)height + 0.5f));
}
/*************************************/
/* cudaArray PRINTOUT TEXTURE KERNEL */
/*************************************/
__global__ void cudaArrayPrintoutSurface(int width, int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
float temp;
surf2Dread(&temp, surf2D, x * 4, y);
printf("Thread index: (%i, %i); cudaArray = %f\n", x, y, temp);
}
/********/
/* MAIN */
/********/
void main()
{
int width = 3, height = 3;
thrust::host_vector<float> h_data(width*height, 3.f);
// --- Allocate CUDA array in device memory
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray* cuArray;
/*******************/
/* TEXTURE BINDING */
/*******************/
gpuErrchk(cudaMallocArray(&cuArray, &channelDesc, width, height));
// --- Copy to host data to device memory
gpuErrchk(cudaMemcpyToArray(cuArray, 0, 0, thrust::raw_pointer_cast(h_data.data()), width*height*sizeof(float), cudaMemcpyHostToDevice));
// --- Set texture parameters
texRef.addressMode[0] = cudaAddressModeWrap;
texRef.addressMode[1] = cudaAddressModeWrap;
texRef.filterMode = cudaFilterModeLinear;
texRef.normalized = true;
// --- Bind the array to the texture reference
gpuErrchk(cudaBindTextureToArray(texRef, cuArray, channelDesc));
// --- Invoking printout kernel
dim3 dimBlock(3, 3);
dim3 dimGrid(1, 1);
cudaArrayPrintoutTexture<<<dimGrid, dimBlock>>>(width, height);
gpuErrchk(cudaUnbindTexture(texRef));
gpuErrchk(cudaFreeArray(cuArray));
/******************/
/* SURFACE MEMORY */
/******************/
gpuErrchk(cudaMallocArray(&cuArray, &channelDesc, width, height, cudaArraySurfaceLoadStore));
// --- Copy to host data to device memory
gpuErrchk(cudaMemcpyToArray(cuArray, 0, 0, thrust::raw_pointer_cast(h_data.data()), width*height*sizeof(float), cudaMemcpyHostToDevice));
gpuErrchk(cudaBindSurfaceToArray(surf2D, cuArray));
cudaArrayPrintoutSurface<<<dimGrid, dimBlock>>>(width, height);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaFreeArray(cuArray));
}
I'm trying to calculate the fft of an image using CUFFT. It seems like CUFFT only offers fft of plain device pointers allocated with cudaMalloc.
My input images are allocated using cudaMallocPitch but there is no option for handling pitch of the image pointer.
Currently, I have to remove the alignment of rows, then execute the fft, and copy back the results to the pitched pointer. My current code is as follows:
void fft_device(float* src, cufftComplex* dst, int width, int height, int srcPitch, int dstPitch)
{
//src and dst are device pointers allocated with cudaMallocPitch
//Convert them to plain pointers. No padding of rows.
float *plainSrc;
cufftComplex *plainDst;
cudaMalloc<float>(&plainSrc,width * height * sizeof(float));
cudaMalloc<cufftComplex>(&plainDst,width * height * sizeof(cufftComplex));
cudaMemcpy2D(plainSrc,width * sizeof(float),src,srcPitch,width * sizeof(float),height,cudaMemcpyDeviceToDevice);
cufftHandle handle;
cufftPlan2d(&handle,width,height,CUFFT_R2C);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,plainSrc,plainDst);
cufftDestroy(handle);
cudaMemcpy2D(dst,dstPitch,plainDst,width * sizeof(cufftComplex),width * sizeof(cufftComplex),height,cudaMemcpyDeviceToDevice);
cudaFree(plainSrc);
cudaFree(plainDst);
}
It gives correct result, but I don't want to do 2 extra memory allocations and copies inside the function. I want to do something like this:
void fft_device(float* src, cufftComplex* dst, int width, int height, int srcPitch, int dstPitch)
{
//src and dst are device pointers allocated with cudaMallocPitch
//Don't know how to handle pitch here???
cufftHandle handle;
cufftPlan2d(&handle,width,height,CUFFT_R2C);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,src,dst);
cufftDestroy(handle);
}
Question:
How to calculate the fft of pitched pointer directly using CUFFT?
I think you may be interested in cufftPlanMany which would let you do 1D, 2D, and 3D ffts with pitches. The key here is inembed and onembed parameters.
You can look up CUDA_CUFFT_Users_Guide.pdf (Pages 23-24) for more information. But for your example, you'd be doing something like the follows.
void fft_device(float* src, cufftComplex* dst,
int width, int height,
int srcPitch, int dstPitch)
{
cufftHandle handle;
int rank = 2; // 2D fft
int n[] = {width, height}; // Size of the Fourier transform
int istride = 1, ostride = 1; // Stride lengths
int idist = 1, odist = 1; // Distance between batches
int inembed[] = {srcPitch, height}; // Input size with pitch
int onembed[] = {dstPitch, height}; // Output size with pitch
int batch = 1;
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_R2C, batch);
cufftSetCompatibilityMode(handle,CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(handle,src,dst);
cufftDestroy(handle);
}
P.S. I did not add return checks for the sake of example here. Always check for return values in your code.
I try to read values from a texture and write them back to global memory.
I am sure the writing part works, beause I can put constant values in the kernel and I can see them in the output:
__global__ void
bartureKernel( float* g_odata, int width, int height)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
if(x < width && y < height) {
unsigned int idx = (y*width + x);
g_odata[idx] = tex2D(texGrad, (float)x, (float)y).x;
}
}
The texture I want to use is a 2D float texture with two channels, so I defined it as:
texture<float2, 2, cudaReadModeElementType> texGrad;
And the code which calls the kernel initializes the texture with some constant non-zero values:
float* d_data_grad = NULL;
cudaMalloc((void**) &d_data_grad, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
texGrad.addressMode[0] = cudaAddressModeClamp;
texGrad.addressMode[1] = cudaAddressModeClamp;
texGrad.filterMode = cudaFilterModeLinear;
texGrad.normalized = false;
cudaMemset(d_data_grad, 50, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
cudaBindTexture(NULL, texGrad, d_data_grad, cudaCreateChannelDesc<float2>(), gradientSize * sizeof(float));
float* d_data_barture = NULL;
cudaMalloc((void**) &d_data_barture, outputSize * sizeof(float));
CHECK_CUDA_ERROR;
dim3 dimBlock(8, 8, 1);
dim3 dimGrid( ((width-1) / dimBlock.x)+1, ((height-1) / dimBlock.y)+1, 1);
bartureKernel<<< dimGrid, dimBlock, 0 >>>( d_data_barture, width, height);
I know, setting the texture bytes to all "50" doesn't make much sense in the context of floats, but it should at least give me some non-zero values to read.
I can only read zeros though...
You are using cudaBindTexture to bind your texture to the memory allocated by cudaMalloc. In the kernel you are using tex2D function to read values from the texture. That is why it is reading zeros.
If you bind texture to linear memory using cudaBindTexture, it is read using tex1Dfetch inside the kernel.
tex2D is used to read only from those textures which are bound to pitch linear memory ( which is allocated by cudaMallocPitch ) using the function cudaBindTexture2D, or those textures which are bound to cudaArray using the function cudaBindTextureToArray
Here is the basic table, rest you can read from the programming guide:
Memory Type----------------- Allocated Using-----------------Bound Using-----------------------Read In The Kernel By
Linear Memory...................cudaMalloc........................cudaBindTexture.............................tex1Dfetch
Pitch Linear Memory.........cudaMallocPitch.............cudaBindTexture2D........................tex2D
cudaArray............................cudaMallocArray.............cudaBindTextureToArray.............tex1D or tex2D
3D cudaArray......................cudaMalloc3DArray........cudaBindTextureToArray.............tex3D
To add on, access using tex1Dfetch is based on integer indexing.
However, the rest are indexed based on floating point, and you have to add +0.5 to get the exact value you want.
I'm curious why do you create float and bind to a float2 texture? It may gives ambiguous results.
float2 is not 2D float texture. It can actually be used for representation of complex number.
typedef struct {float x; float y;} float2;
I think this tutorial will help you understand how to use texture memory in cuda.
http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/218100902
The kernel you shown does not benefit much from using texture. however, if utilized properly, by exploiting locality, texture memory can improve the performance by quite a lot. Also, it is useful for interpolation.
I had a simple CUDA problem for a class assignment, but the professor added an optional task to implement the same algorithm using shared memory instead. I was unable to finish it before the deadline (as in, the turn-in date was a week ago) but I'm still curious so now I'm going to ask the internet ;).
The basic assignment was to implement a bastardized version of a red-black successive over-relaxation both sequentially and in CUDA, make sure you got the same result in both and then compare the speedup. Like I said, doing it with shared memory was an optional +10% add-on.
I'm going to post my working version and pseudocode what I've attempted to do since I don't have the code in my hands at the moment, but I can update this later with the actual code if someone needs it.
Before anyone says it: Yes, I know using CUtil is lame, but it made the comparison and timers easier.
Working global memory version:
#include <stdlib.h>
#include <stdio.h>
#include <cutil_inline.h>
#define N 1024
__global__ void kernel(int *d_A, int *d_B) {
unsigned int index_x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int index_y = blockIdx.y * blockDim.y + threadIdx.y;
// map the two 2D indices to a single linear, 1D index
unsigned int grid_width = gridDim.x * blockDim.x;
unsigned int index = index_y * grid_width + index_x;
// check for boundaries and write out the result
if((index_x > 0) && (index_y > 0) && (index_x < N-1) && (index_y < N-1))
d_B[index] = (d_A[index-1]+d_A[index+1]+d_A[index+N]+d_A[index-N])/4;
}
main (int argc, char **argv) {
int A[N][N], B[N][N];
int *d_A, *d_B; // These are the copies of A and B on the GPU
int *h_B; // This is a host copy of the output of B from the GPU
int i, j;
int num_bytes = N * N * sizeof(int);
// Input is randomly generated
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
A[i][j] = rand()/1795831;
//printf("%d\n",A[i][j]);
}
}
cudaEvent_t start_event0, stop_event0;
float elapsed_time0;
CUDA_SAFE_CALL( cudaEventCreate(&start_event0) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event0) );
cudaEventRecord(start_event0, 0);
// sequential implementation of main computation
for(i=1;i<N-1;i++) {
for(j=1;j<N-1;j++) {
B[i][j] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4;
}
}
cudaEventRecord(stop_event0, 0);
cudaEventSynchronize(stop_event0);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time0,start_event0, stop_event0) );
h_B = (int *)malloc(num_bytes);
memset(h_B, 0, num_bytes);
//ALLOCATE MEMORY FOR GPU COPIES OF A AND B
cudaMalloc((void**)&d_A, num_bytes);
cudaMalloc((void**)&d_B, num_bytes);
cudaMemset(d_A, 0, num_bytes);
cudaMemset(d_B, 0, num_bytes);
//COPY A TO GPU
cudaMemcpy(d_A, A, num_bytes, cudaMemcpyHostToDevice);
// create CUDA event handles for timing purposes
cudaEvent_t start_event, stop_event;
float elapsed_time;
CUDA_SAFE_CALL( cudaEventCreate(&start_event) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event) );
cudaEventRecord(start_event, 0);
// TODO: CREATE BLOCKS AND THREADS AND INVOKE GPU KERNEL
dim3 block_size(256,1,1); //values experimentally determined to be fastest
dim3 grid_size;
grid_size.x = N / block_size.x;
grid_size.y = N / block_size.y;
kernel<<<grid_size,block_size>>>(d_A,d_B);
cudaEventRecord(stop_event, 0);
cudaEventSynchronize(stop_event);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time,start_event, stop_event) );
//COPY B BACK FROM GPU
cudaMemcpy(h_B, d_B, num_bytes, cudaMemcpyDeviceToHost);
// Verify result is correct
CUTBoolean res = cutComparei( (int *)B, (int *)h_B, N*N);
printf("Test %s\n",(1 == res)?"Passed":"Failed");
printf("Elapsed Time for Sequential: \t%.2f ms\n", elapsed_time0);
printf("Elapsed Time for CUDA:\t%.2f ms\n", elapsed_time);
printf("CUDA Speedup:\t%.2fx\n",(elapsed_time0/elapsed_time));
cudaFree(d_A);
cudaFree(d_B);
free(h_B);
cutilDeviceReset();
}
For the shared memory version, this is what I've tried so far:
#define N 1024
__global__ void kernel(int *d_A, int *d_B, int width) {
//assuming width is 64 because that's the biggest number I can make it
//each MP has 48KB of shared mem, which is 12K ints, 32 threads/warp, so max 375 ints/thread?
__shared__ int A_sh[3][66];
//get x and y index and turn it into linear index
for(i=0; i < width+2; i++) //have to load 2 extra values due to the -1 and +1 in algo
A_sh[index_y%3][i] = d_A[index+i-1]; //so A_sh[index_y%3][0] is actually d_A[index-1]
__syncthreads(); //and hope that previous and next row have been loaded by other threads in the block?
//ignore boundary conditions because it's pseudocode
for(i=0; i < width; i++)
d_B[index+i] = A_sh[index_y%3][i] + A_sh[index_y%3][i+2] + A_sh[index_y%3-1][i+1] + A_sh[index_y%3+1][i+1];
}
main(){
//same init as above until threads/grid init
dim3 threadsperblk(32,16);
dim3 numblks(32,64);
kernel<<<numblks,threadsperblk>>>(d_A,d_B,64);
//rest is the same
}
This shared mem code crashes ("launch failed due to unspecified error") since I haven't caught all the boundary conditions yet, but I'm not worried about that as much as finding the correct way to get things going. I feel that my code is way too complicated to be the correct path (especially compared to the SDK examples), but I also can't see another way to do it since my array doesn't fit into shared mem like all the examples I can find.
And frankly, I'm not sure it would be that much faster on my hardware (GTX 560 Ti - runs the global memory version in 0.121ms), but I need to prove it to myself first :P
Edit 2: For anyone who runs across this in the future, the code in the answer is a good starting point if you want to do some shared memory.
The key to getting the maximum out of these sort of stencil operators in CUDA is data re-usage. I have found that the best approach is usually to have each block "walk" through a dimension of the grid. After the block has loaded an initial tile of data into shared memory, only a single dimension (so row in a row-major order 2D problem ) needs to be read from global memory to have the necessary data in shared memory for the second and subsequent row calculations. The rest of the data can just be reused. To visualise how the shared memory buffer looks through the first four steps of this sort of algorithm:
Three "rows" (a,b,c) of the input grid are loaded to shared memory, and the stencil computed for row (b) and written to global memory
aaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (d) is loaded into the shared memory buffer, replacing row (a), and the calculations made for row (c) using a different stencil, reflecting where the row data is in shared memory
dddddddddddddddd
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (e) is loaded into the shared memory buffer, replacing row (b), and the calculations made for row (d), using a different stencil from either step 1 or 2.
dddddddddddddddd
eeeeeeeeeeeeeeee
cccccccccccccccc
Another row (f) is loaded into the shared memory buffer, replacing row (c), and the calculations made for row (e). Now the data is back to the same layout as used in step 1, and the same stencil used in step 1 can be used.
dddddddddddddddd
eeeeeeeeeeeeeeee
ffffffffffffffff
The whole cycle repeats until the block has traverse full column length of the input grid. The reason for using different stencils rather than shifting the data in the shared memory buffer is down to performance - shared memory only has about 1000 Gb/s bandwidth on Fermi, and the shifting of data will become a bottleneck in fully optimal code. You should try different buffer sizes, because you might find smaller buffers allows for higher occupancy and improved kernel throughput.
EDIT: To give a concrete example of how that might be implemented:
template<int width>
__device__ void rowfetch(int *in, int *out, int col)
{
*out = *in;
if (col == 1) *(out-1) = *(in-1);
if (col == width) *(out+1) = *(in+1);
}
template<int width>
__global__ operator(int *in, int *out, int nrows, unsigned int lda)
{
// shared buffer holds three rows x (width+2) cols(threads)
__shared__ volatile int buffer [3][2+width];
int colid = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x + 1;
int * rowpos = &in[colid], * outpos = &out[colid];
// load the first three rows (compiler will unroll loop)
for(int i=0; i<3; i++, rowpos+=lda) {
rowfetch<width>(rowpos, &buffer[i][tid], tid);
}
__syncthreads(); // shared memory loaded and all threads ready
int brow = 0; // brow is the next buffer row to load data onto
for(int i=0; i<nrows; i++, rowpos+=lda, outpos+=lda) {
// Do stencil calculations - use the value of brow to determine which
// stencil to use
result = ();
// write result to outpos
*outpos = result;
// Fetch another row
__syncthreads(); // Wait until all threads are done calculating
rowfetch<width>(rowpos, &buffer[brow][tid], tid);
brow = (brow < 2) ? (brow+1) : 0; // Increment or roll brow over
__syncthreads(); // Wait until all threads have updated the buffer
}
}