Read cudaArray in device code - cuda

Is there a way to read the values in a cudaArray from the device without wrapping it in a texture reference/object? All of the examples I've looked at use cudaArrays exclusively for creating textures. Is that the only way they can be used, or could I do something like:
__global__ kernel(cudaArray *arr, ...) {
float x = tex1D<float>(arr, ...);
...
}
cudaArray *arr;
cudaMallocArray(&arr, ...);
cudaMemcpyToArray(arr, ...);
kernel<<<...>>>(arr, ...);
So basically, what should go in place of tex1D there? Also, if this is possible I'd be curious if anyone thinks there would be any performance benefit to doing this, but I'll also be running my own tests to see.
Thanks!

cudaArray is defined for texturing or surface memory purposes. As indicated here:
CUDA arrays are opaque memory layouts optimized for texture fetching.
They are one dimensional, two dimensional, or three-dimensional and
composed of elements, each of which has 1, 2 or 4 components that may
be signed or unsigned 8 , 16 or 32 bit integers, 16 bit floats, or 32
bit floats. CUDA arrays are only accessible by kernels through texture
fetching as described in Texture Memory or surface reading and writing
as described in Surface Memory.
So in effect you have to use either texture functions or surface functions in kernels to access data in a cudaArray.
There are several performance benefit possibilities associated with using texturing. Texturing can imply interpolation (i.e. reading from a texture using floating point coordinates). Any application that needs this kind of data interpolation may benefit from the HW interpolation engines inside the texture units on the GPU.
Another benefit, perhaps the most important for using texturing in arbitrary GPU codes, is the texture cache that backs up the textures stored in global memory. Texturing is a read-only operation, but if you have an array of read-only data, the texture cache may improve or otherwise extend your ability to access data rapidly. This generally implies that there must be data-locality/ data-reuse in your functions that are accessing data stored in the texturing mechanism. Texture data retrieved will not disrupt anything in the L1 cache, so generally this kind of data segmentation/optimization would be part of a larger strategy around data caching. If there were no other demands on L1 cache, the texture mechanism/cache does not provide faster access to data than if it were in the L1 already.

Robert Crovella has already answered to your question. I believe it could be useful for next users to have a worked example for the two solutions: textures and sufaces.
#include <stdio.h>
#include <thrust\device_vector.h>
// --- 2D float texture
texture<float, cudaTextureType2D, cudaReadModeElementType> texRef;
// --- 2D surface memory
surface<void, 2> surf2D;
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
/*************************************/
/* cudaArray PRINTOUT TEXTURE KERNEL */
/*************************************/
__global__ void cudaArrayPrintoutTexture(int width, int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
printf("Thread index: (%i, %i); cudaArray = %f\n", x, y, tex2D(texRef, x / (float)width + 0.5f, y / (float)height + 0.5f));
}
/*************************************/
/* cudaArray PRINTOUT TEXTURE KERNEL */
/*************************************/
__global__ void cudaArrayPrintoutSurface(int width, int height)
{
unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
float temp;
surf2Dread(&temp, surf2D, x * 4, y);
printf("Thread index: (%i, %i); cudaArray = %f\n", x, y, temp);
}
/********/
/* MAIN */
/********/
void main()
{
int width = 3, height = 3;
thrust::host_vector<float> h_data(width*height, 3.f);
// --- Allocate CUDA array in device memory
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray* cuArray;
/*******************/
/* TEXTURE BINDING */
/*******************/
gpuErrchk(cudaMallocArray(&cuArray, &channelDesc, width, height));
// --- Copy to host data to device memory
gpuErrchk(cudaMemcpyToArray(cuArray, 0, 0, thrust::raw_pointer_cast(h_data.data()), width*height*sizeof(float), cudaMemcpyHostToDevice));
// --- Set texture parameters
texRef.addressMode[0] = cudaAddressModeWrap;
texRef.addressMode[1] = cudaAddressModeWrap;
texRef.filterMode = cudaFilterModeLinear;
texRef.normalized = true;
// --- Bind the array to the texture reference
gpuErrchk(cudaBindTextureToArray(texRef, cuArray, channelDesc));
// --- Invoking printout kernel
dim3 dimBlock(3, 3);
dim3 dimGrid(1, 1);
cudaArrayPrintoutTexture<<<dimGrid, dimBlock>>>(width, height);
gpuErrchk(cudaUnbindTexture(texRef));
gpuErrchk(cudaFreeArray(cuArray));
/******************/
/* SURFACE MEMORY */
/******************/
gpuErrchk(cudaMallocArray(&cuArray, &channelDesc, width, height, cudaArraySurfaceLoadStore));
// --- Copy to host data to device memory
gpuErrchk(cudaMemcpyToArray(cuArray, 0, 0, thrust::raw_pointer_cast(h_data.data()), width*height*sizeof(float), cudaMemcpyHostToDevice));
gpuErrchk(cudaBindSurfaceToArray(surf2D, cuArray));
cudaArrayPrintoutSurface<<<dimGrid, dimBlock>>>(width, height);
gpuErrchk(cudaPeekAtLastError());
gpuErrchk(cudaDeviceSynchronize());
gpuErrchk(cudaFreeArray(cuArray));
}

Related

Smart design for large kernel with different inputs that only changes one line of code

I am designing some kernels that I would like to have 2 ways of calling: Once with standard float * device as input (for writing), and another with cudaSurfaceObject_t as input (for writing). The kernel itself is long (>200 lines) and ultimately, I only need the last line to be different. In one case you have standard out[idx]=val type of assignment, while in the other one a surf3Dwrite() type. The rest of the kernel is identical.
Something like
__global__ kernel(float * out , ....)
{
// 200 lines of math
// only difference, aside from input argument
idx=....
out[idx]=a;
}
vs
__global__ kernel(cudaSurfaceObject_t * out, ...)
{
// 200 lines of math
// only difference, aside from input argument
surf3Dwrite(&out,val,x,y,z);
}
What is the smart way of coding this, without copy pasting the entire kernel and renaming it? I checked Templating, but (if I am not wrong) its for types only, one can not just have a completely different line of code when the type is different in a template. CUDA kernels don't seem to be able to be overloaded either.
CUDA kernels don't seem to be able to be overloaded either.
It should be possible to overload kernels. Here is one possible approach, using overloading (and no templating):
$ cat t1648.cu
// Includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <helper_cuda.h>
__device__ float my_common(float *d, int width, unsigned int x, unsigned int y){
// 200 lines of common code...
return d[y *width +x];
}
////////////////////////////////////////////////////////////////////////////////
// Kernels
////////////////////////////////////////////////////////////////////////////////
//! Write to a cuArray using surface writes
//! #param gIData input data in global memory
////////////////////////////////////////////////////////////////////////////////
__global__ void WriteKernel(float *gIData, int width, int height,
cudaSurfaceObject_t outputSurface)
{
// calculate surface coordinates
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
// read from global memory and write to cuarray (via surface reference)
surf2Dwrite(my_common(gIData, width, x, y),
outputSurface, x*4, y, cudaBoundaryModeTrap);
}
__global__ void WriteKernel(float *gIData, int width, int height,
float *out)
{
// calculate coordinates
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
// read from global memory and write to global memory
out[y*width+x] = my_common(gIData, width, x, y);
}
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int main(int argc, char **argv)
{
printf("starting...\n");
unsigned width = 256;
unsigned height = 256;
unsigned int size = width * height * sizeof(float);
// Allocate device memory for result
float *dData = NULL;
checkCudaErrors(cudaMalloc((void **) &dData, size));
// Allocate array and copy image data
cudaChannelFormatDesc channelDesc =
cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
cudaArray *cuArray;
float *out;
cudaMalloc(&out, size);
checkCudaErrors(cudaMallocArray(&cuArray,
&channelDesc,
width,
height,
cudaArraySurfaceLoadStore));
dim3 dimBlock(8, 8, 1);
dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);
cudaSurfaceObject_t outputSurface;
cudaResourceDesc surfRes;
memset(&surfRes, 0, sizeof(cudaResourceDesc));
surfRes.resType = cudaResourceTypeArray;
surfRes.res.array.array = cuArray;
checkCudaErrors(cudaCreateSurfaceObject(&outputSurface, &surfRes));
WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, outputSurface);
WriteKernel<<<dimGrid, dimBlock>>>(dData, width, height, out);
checkCudaErrors(cudaDestroySurfaceObject(outputSurface));
checkCudaErrors(cudaFree(dData));
checkCudaErrors(cudaFreeArray(cuArray));
}
$ nvcc -I/usr/local/cuda/samples/common/inc t1648.cu -o t1648
$
The above example was hacked together rapidly from the simpleSurfaceWrite CUDA sample code. It is not intended to be functional or run "correctly". It is designed to show how overloading can be used from a code structure standpoint to address the stated objective.

Multiplication of two cudaArray in kernel?(using texture memory)

I have two cudaArray, a1 and a2 (which have the same size) which reprensent two matrices .
Using texture memory, I want to multiplicate those two cudaArrays .
Then I want to copy back the result in one normal arrays,let's name it *a1_h.
The fact is, I just don't know how to do it . I've managed to define, allocate my two cudaArrays and to put floats into them .
Now I want to do a kernel which does those multiplications .
Can somebody help me ?
ROOM_X and ROOM_Y are int, they define width and height of matrices .
mytex_M1 and mytex_M2 are texture defined as : texture < float,2,cudaReadModeElementType > .
Here is my main :
int main(int argc, char * argv[]) {
int size = ROOM_X * ROOM_Y * sizeof(float);
//creation of arrays on host.Will be useful for filling the cudaArrays
float *M1_h, *M2_h;
//allocating memories on Host
M1_h = (float *)malloc(size);
M2_h = (float *)malloc(size);
//creation of channel descriptions for 2d texture
cudaChannelFormatDesc channelDesc_M1 = cudaCreateChannelDesc<float>();
cudaChannelFormatDesc channelDesc_M2 = cudaCreateChannelDesc<float>();
//creation of 2 cudaArray * .
cudaArray *M1_array,*M2_array;
//bind arrays and channel in order to allocate space
cudaMallocArray(&M1_array,&channelDesc_M1,ROOM_X,ROOM_Y);
cudaMallocArray(&M2_array,&channelDesc_M2,ROOM_X,ROOM_Y);
//filling the matrices on host
Matrix(M1_h);
Matrix(M2_h);
//copy from host to device (putting the initial values of M1 and M2 into the arrays)
cudaMemcpyToArray(M1_array, 0, 0,M1_h, size,cudaMemcpyHostToDevice);
cudaMemcpyToArray(M2_array, 0, 0,M2_h, size,cudaMemcpyHostToDevice);
//set textures parameters
mytex_M1.addressMode[0] = cudaAddressModeWrap;
mytex_M1.addressMode[1] = cudaAddressModeWrap;
mytex_M1.filterMode = cudaFilterModeLinear;
mytex_M1.normalized = true; //NB coordinates in [0,1]
mytex_M2.addressMode[0] = cudaAddressModeWrap;
mytex_M2.addressMode[1] = cudaAddressModeWrap;
mytex_M2.filterMode = cudaFilterModeLinear;
mytex_M2.normalized = true; //NB coordinates in [0,1]
//bind arrays to the textures
cudaBindTextureToArray(mytex_M1,M1_array);
cudaBindTextureToArray(mytex_M2,M2_array);
//allocate device memory for result
float* M1_d;
cudaMalloc( (void**)&M1_d, size);
//dimensions of grid and blocks
dim3 dimGrid(ROOM_X,ROOM_Y);
dim3 dimBlock(1,1);
//execution of the kernel . The result of the multiplication has to be put in M1_d
mul_texture<<<dimGrid, dimBlock >>>(M1_d);
//copy result from device to host
cudaMemcpy(M1_h,M1_d, size, cudaMemcpyDeviceToHost);
//free memory on device
cudaFreeArray(M1_array);
cudaFreeArray(M2_array);
cudaFree(M1_d);
//free memory on host
free(M1_h);
free(M2_h);
return 0;
}
When you declare a texture
A texture reference can only be declared as a static global variable and cannot be passed as an argument to a function.
http://docs.nvidia.com/cuda/cuda-c-programming-guide/#texture-reference-api
So, if you have successfully define the texture references, initialize the arrays, copy then to the texture space and prepare the output buffers (something that seems to be done according to your code), what you need to do is implement the kernel. For example:
__global__ void
mul_texture(float* M1_d, int w, int h)
{
// map from threadIdx/BlockIdx to pixel position
int x = threadIdx.x + blockIdx.x * blockDim.x;
int y = threadIdx.y + blockIdx.y * blockDim.y;
// take care of the size of the image, it's a good practice
if ( x < w && y < h )
{
// the output M1_d is actually represented as 1D array
// so the offset of each value is related to their (x,y) position
// in a tow-major order
int gid = x + y * w;
// As texture are declared at global scope,
// we can access their content at any kernel
float M1_value = tex2D(mytex_M1,x,y);
float M2_value = tex2D(mytex_M2,x,y);
// The final results is the pointwise multiplication
M1_d[ gid ] = M1_value * M2_value;
}
}
You need to change the kernel invocation to include the w and h values, corresponding to the width (number of columns in the matrix) and height (number of rows of the matrix).
mul_texture<<<dimGrid, dimBlock >>>(M1_d, ROOM_X, ROOM_Y);
Note you are not doing any error checking, something that will help you quite a lot now and in the future. I have not checked if the kernel provided in this answer works as your code didnt compile.

CUDA: sum-reduction --- data lost in call to device function [duplicate]

I'm aware that there are multiple questions similar to this one already answered but I've been unable to piece together anything very helpful from them other than that I'm probably incorrectly indexing something.
I'm trying to preform a sequential addressing reduction on input vector A into output vector B.
The full code is available here http://pastebin.com/7UGadgjX, but this is the kernel:
__global__ void vectorSum(int *A, int *B, int numElements) {
extern __shared__ int S[];
// Each thread loads one element from global to shared memory
int tid = threadIdx.x;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
S[tid] = A[i];
__syncthreads();
// Reduce in shared memory
for (int t = blockDim.x/2; t > 0; t>>=1) {
if (tid < t) {
S[tid] += S[tid + t];
}
__syncthreads();
}
if (tid == 0) B[blockIdx.x] = S[0];
}
}
and these are the kernel launch statements:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
vectorSum<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, numElements);
I'm getting a unspecified launch error which I've read is similar to a segfault. I've been following the nvidia reduction documentation closely and tried to keep my kernel within the bounds of numElements but I seem to be missing something key considering how simple the code is.
Your problem is that the reduction kernel requires dynamically allocated shared memory to operate correctly, but your kernel launch doesn't specify any. The result is out of bounds/illegal shared memory access which aborts the kernel.
In CUDA runtime API syntax, the kernel launch statement has four arguments. The first two are the grid and block dimensions for the launch. The latter two are optional with zero default values, but specify the dynamically allocated shared memory size and stream.
To fix this, change the launch code as follows:
// Launch the Vector Summation CUDA Kernel
int threadsPerBlock = 256;
int blocksPerGrid =(numElements + threadsPerBlock - 1) / threadsPerBlock;
size_t shmsz = (size_t)threadsPerBlock * sizeof(int);
vectorSum<<<blocksPerGrid, threadsPerBlock, shmsz>>>(d_A, d_B, numElements);
[disclaimer: code written in browser, not compiled or tested, use at own risk]
This should at least fix the most obvious problem with your code.

Bound CUDA texture reads zero

I try to read values from a texture and write them back to global memory.
I am sure the writing part works, beause I can put constant values in the kernel and I can see them in the output:
__global__ void
bartureKernel( float* g_odata, int width, int height)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
if(x < width && y < height) {
unsigned int idx = (y*width + x);
g_odata[idx] = tex2D(texGrad, (float)x, (float)y).x;
}
}
The texture I want to use is a 2D float texture with two channels, so I defined it as:
texture<float2, 2, cudaReadModeElementType> texGrad;
And the code which calls the kernel initializes the texture with some constant non-zero values:
float* d_data_grad = NULL;
cudaMalloc((void**) &d_data_grad, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
texGrad.addressMode[0] = cudaAddressModeClamp;
texGrad.addressMode[1] = cudaAddressModeClamp;
texGrad.filterMode = cudaFilterModeLinear;
texGrad.normalized = false;
cudaMemset(d_data_grad, 50, gradientSize * sizeof(float));
CHECK_CUDA_ERROR;
cudaBindTexture(NULL, texGrad, d_data_grad, cudaCreateChannelDesc<float2>(), gradientSize * sizeof(float));
float* d_data_barture = NULL;
cudaMalloc((void**) &d_data_barture, outputSize * sizeof(float));
CHECK_CUDA_ERROR;
dim3 dimBlock(8, 8, 1);
dim3 dimGrid( ((width-1) / dimBlock.x)+1, ((height-1) / dimBlock.y)+1, 1);
bartureKernel<<< dimGrid, dimBlock, 0 >>>( d_data_barture, width, height);
I know, setting the texture bytes to all "50" doesn't make much sense in the context of floats, but it should at least give me some non-zero values to read.
I can only read zeros though...
You are using cudaBindTexture to bind your texture to the memory allocated by cudaMalloc. In the kernel you are using tex2D function to read values from the texture. That is why it is reading zeros.
If you bind texture to linear memory using cudaBindTexture, it is read using tex1Dfetch inside the kernel.
tex2D is used to read only from those textures which are bound to pitch linear memory ( which is allocated by cudaMallocPitch ) using the function cudaBindTexture2D, or those textures which are bound to cudaArray using the function cudaBindTextureToArray
Here is the basic table, rest you can read from the programming guide:
Memory Type----------------- Allocated Using-----------------Bound Using-----------------------Read In The Kernel By
Linear Memory...................cudaMalloc........................cudaBindTexture.............................tex1Dfetch
Pitch Linear Memory.........cudaMallocPitch.............cudaBindTexture2D........................tex2D
cudaArray............................cudaMallocArray.............cudaBindTextureToArray.............tex1D or tex2D
3D cudaArray......................cudaMalloc3DArray........cudaBindTextureToArray.............tex3D
To add on, access using tex1Dfetch is based on integer indexing.
However, the rest are indexed based on floating point, and you have to add +0.5 to get the exact value you want.
I'm curious why do you create float and bind to a float2 texture? It may gives ambiguous results.
float2 is not 2D float texture. It can actually be used for representation of complex number.
typedef struct {float x; float y;} float2;
I think this tutorial will help you understand how to use texture memory in cuda.
http://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/218100902
The kernel you shown does not benefit much from using texture. however, if utilized properly, by exploiting locality, texture memory can improve the performance by quite a lot. Also, it is useful for interpolation.

CUDA: Shared memory over a large-ish 2D array

I had a simple CUDA problem for a class assignment, but the professor added an optional task to implement the same algorithm using shared memory instead. I was unable to finish it before the deadline (as in, the turn-in date was a week ago) but I'm still curious so now I'm going to ask the internet ;).
The basic assignment was to implement a bastardized version of a red-black successive over-relaxation both sequentially and in CUDA, make sure you got the same result in both and then compare the speedup. Like I said, doing it with shared memory was an optional +10% add-on.
I'm going to post my working version and pseudocode what I've attempted to do since I don't have the code in my hands at the moment, but I can update this later with the actual code if someone needs it.
Before anyone says it: Yes, I know using CUtil is lame, but it made the comparison and timers easier.
Working global memory version:
#include <stdlib.h>
#include <stdio.h>
#include <cutil_inline.h>
#define N 1024
__global__ void kernel(int *d_A, int *d_B) {
unsigned int index_x = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int index_y = blockIdx.y * blockDim.y + threadIdx.y;
// map the two 2D indices to a single linear, 1D index
unsigned int grid_width = gridDim.x * blockDim.x;
unsigned int index = index_y * grid_width + index_x;
// check for boundaries and write out the result
if((index_x > 0) && (index_y > 0) && (index_x < N-1) && (index_y < N-1))
d_B[index] = (d_A[index-1]+d_A[index+1]+d_A[index+N]+d_A[index-N])/4;
}
main (int argc, char **argv) {
int A[N][N], B[N][N];
int *d_A, *d_B; // These are the copies of A and B on the GPU
int *h_B; // This is a host copy of the output of B from the GPU
int i, j;
int num_bytes = N * N * sizeof(int);
// Input is randomly generated
for(i=0;i<N;i++) {
for(j=0;j<N;j++) {
A[i][j] = rand()/1795831;
//printf("%d\n",A[i][j]);
}
}
cudaEvent_t start_event0, stop_event0;
float elapsed_time0;
CUDA_SAFE_CALL( cudaEventCreate(&start_event0) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event0) );
cudaEventRecord(start_event0, 0);
// sequential implementation of main computation
for(i=1;i<N-1;i++) {
for(j=1;j<N-1;j++) {
B[i][j] = (A[i-1][j]+A[i+1][j]+A[i][j-1]+A[i][j+1])/4;
}
}
cudaEventRecord(stop_event0, 0);
cudaEventSynchronize(stop_event0);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time0,start_event0, stop_event0) );
h_B = (int *)malloc(num_bytes);
memset(h_B, 0, num_bytes);
//ALLOCATE MEMORY FOR GPU COPIES OF A AND B
cudaMalloc((void**)&d_A, num_bytes);
cudaMalloc((void**)&d_B, num_bytes);
cudaMemset(d_A, 0, num_bytes);
cudaMemset(d_B, 0, num_bytes);
//COPY A TO GPU
cudaMemcpy(d_A, A, num_bytes, cudaMemcpyHostToDevice);
// create CUDA event handles for timing purposes
cudaEvent_t start_event, stop_event;
float elapsed_time;
CUDA_SAFE_CALL( cudaEventCreate(&start_event) );
CUDA_SAFE_CALL( cudaEventCreate(&stop_event) );
cudaEventRecord(start_event, 0);
// TODO: CREATE BLOCKS AND THREADS AND INVOKE GPU KERNEL
dim3 block_size(256,1,1); //values experimentally determined to be fastest
dim3 grid_size;
grid_size.x = N / block_size.x;
grid_size.y = N / block_size.y;
kernel<<<grid_size,block_size>>>(d_A,d_B);
cudaEventRecord(stop_event, 0);
cudaEventSynchronize(stop_event);
CUDA_SAFE_CALL( cudaEventElapsedTime(&elapsed_time,start_event, stop_event) );
//COPY B BACK FROM GPU
cudaMemcpy(h_B, d_B, num_bytes, cudaMemcpyDeviceToHost);
// Verify result is correct
CUTBoolean res = cutComparei( (int *)B, (int *)h_B, N*N);
printf("Test %s\n",(1 == res)?"Passed":"Failed");
printf("Elapsed Time for Sequential: \t%.2f ms\n", elapsed_time0);
printf("Elapsed Time for CUDA:\t%.2f ms\n", elapsed_time);
printf("CUDA Speedup:\t%.2fx\n",(elapsed_time0/elapsed_time));
cudaFree(d_A);
cudaFree(d_B);
free(h_B);
cutilDeviceReset();
}
For the shared memory version, this is what I've tried so far:
#define N 1024
__global__ void kernel(int *d_A, int *d_B, int width) {
//assuming width is 64 because that's the biggest number I can make it
//each MP has 48KB of shared mem, which is 12K ints, 32 threads/warp, so max 375 ints/thread?
__shared__ int A_sh[3][66];
//get x and y index and turn it into linear index
for(i=0; i < width+2; i++) //have to load 2 extra values due to the -1 and +1 in algo
A_sh[index_y%3][i] = d_A[index+i-1]; //so A_sh[index_y%3][0] is actually d_A[index-1]
__syncthreads(); //and hope that previous and next row have been loaded by other threads in the block?
//ignore boundary conditions because it's pseudocode
for(i=0; i < width; i++)
d_B[index+i] = A_sh[index_y%3][i] + A_sh[index_y%3][i+2] + A_sh[index_y%3-1][i+1] + A_sh[index_y%3+1][i+1];
}
main(){
//same init as above until threads/grid init
dim3 threadsperblk(32,16);
dim3 numblks(32,64);
kernel<<<numblks,threadsperblk>>>(d_A,d_B,64);
//rest is the same
}
This shared mem code crashes ("launch failed due to unspecified error") since I haven't caught all the boundary conditions yet, but I'm not worried about that as much as finding the correct way to get things going. I feel that my code is way too complicated to be the correct path (especially compared to the SDK examples), but I also can't see another way to do it since my array doesn't fit into shared mem like all the examples I can find.
And frankly, I'm not sure it would be that much faster on my hardware (GTX 560 Ti - runs the global memory version in 0.121ms), but I need to prove it to myself first :P
Edit 2: For anyone who runs across this in the future, the code in the answer is a good starting point if you want to do some shared memory.
The key to getting the maximum out of these sort of stencil operators in CUDA is data re-usage. I have found that the best approach is usually to have each block "walk" through a dimension of the grid. After the block has loaded an initial tile of data into shared memory, only a single dimension (so row in a row-major order 2D problem ) needs to be read from global memory to have the necessary data in shared memory for the second and subsequent row calculations. The rest of the data can just be reused. To visualise how the shared memory buffer looks through the first four steps of this sort of algorithm:
Three "rows" (a,b,c) of the input grid are loaded to shared memory, and the stencil computed for row (b) and written to global memory
aaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (d) is loaded into the shared memory buffer, replacing row (a), and the calculations made for row (c) using a different stencil, reflecting where the row data is in shared memory
dddddddddddddddd
bbbbbbbbbbbbbbbb
cccccccccccccccc
Another row (e) is loaded into the shared memory buffer, replacing row (b), and the calculations made for row (d), using a different stencil from either step 1 or 2.
dddddddddddddddd
eeeeeeeeeeeeeeee
cccccccccccccccc
Another row (f) is loaded into the shared memory buffer, replacing row (c), and the calculations made for row (e). Now the data is back to the same layout as used in step 1, and the same stencil used in step 1 can be used.
dddddddddddddddd
eeeeeeeeeeeeeeee
ffffffffffffffff
The whole cycle repeats until the block has traverse full column length of the input grid. The reason for using different stencils rather than shifting the data in the shared memory buffer is down to performance - shared memory only has about 1000 Gb/s bandwidth on Fermi, and the shifting of data will become a bottleneck in fully optimal code. You should try different buffer sizes, because you might find smaller buffers allows for higher occupancy and improved kernel throughput.
EDIT: To give a concrete example of how that might be implemented:
template<int width>
__device__ void rowfetch(int *in, int *out, int col)
{
*out = *in;
if (col == 1) *(out-1) = *(in-1);
if (col == width) *(out+1) = *(in+1);
}
template<int width>
__global__ operator(int *in, int *out, int nrows, unsigned int lda)
{
// shared buffer holds three rows x (width+2) cols(threads)
__shared__ volatile int buffer [3][2+width];
int colid = threadIdx.x + blockIdx.x * blockDim.x;
int tid = threadIdx.x + 1;
int * rowpos = &in[colid], * outpos = &out[colid];
// load the first three rows (compiler will unroll loop)
for(int i=0; i<3; i++, rowpos+=lda) {
rowfetch<width>(rowpos, &buffer[i][tid], tid);
}
__syncthreads(); // shared memory loaded and all threads ready
int brow = 0; // brow is the next buffer row to load data onto
for(int i=0; i<nrows; i++, rowpos+=lda, outpos+=lda) {
// Do stencil calculations - use the value of brow to determine which
// stencil to use
result = ();
// write result to outpos
*outpos = result;
// Fetch another row
__syncthreads(); // Wait until all threads are done calculating
rowfetch<width>(rowpos, &buffer[brow][tid], tid);
brow = (brow < 2) ? (brow+1) : 0; // Increment or roll brow over
__syncthreads(); // Wait until all threads have updated the buffer
}
}