CUDA 2D Convolution kernel - cuda

I'm a beginner in CUDA and I'm trying to implement a Sobel Edge detection kernel.
I'm using this code for it but it doesn't work.
Can anyone tell me what is wrong with it. I just get some -1's and some really big values.
__global__ void EdgeDetect_Hor(int *gpu_Edge_Hor, int *gpu_P,
int *gpu_Hor, int W, int H)
{
int X = threadIdx.x;
int Y = threadIdx.y;
int sum = 0;
int k1, k2;
int min1, min2;
for (k1 = 0; k1 < 3; k1++)
for(k2 = 0; k2 <3;k2++)
sum += gpu_Hor[k1*3+k2]*gpu_P[(X-k1)*H+Y-k2];
gpu_Edge_Hor[X*H+Y] = sum/5000;
}
I call this kernel like this:
dim3 dimBlock(W,H);
dim3 dimGrid(1,1);
EdgeDetect_Hor<<<dimGrid, dimBlock>>>(gpu_Edge_Hor, gpu_P, gpu_Hor, W, H);

First, your problem is that you process image of 480x720 pixels. CUDA supports maximum size of thread block 1024 for compute capability 2.0 and greater and 512 for previous. So you can't execute so many threads in one block. The line dim3 dimBlock(W,H); is incorrect. You should divide your threads to several blocks.
Another problem is that CUDA process data in row-major order. So you should change you memory access pattern.
Right memory access pattern for 2D arrays in CUDA is
BaseAddress + width * Y + X
where
unsigned int X = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int Y = blockIdx.y * blockDim.y + threadIdx.y;

Related

Modifying the basic example VECADD to use the shared memory

I wrote the following kernel to use the shared memory into the basic CUDA example vecadd (sum of two vectors). The code works, but the elapsed time for the kernel execution is the same as the basic original code. May someone suggest me a way to easily speed up such a code?
__global__ void vecAdd(float *in1, float *in2, float *out,long int len)
{
__shared__ float s_in1[THREADS_PER_BLOCK];
__shared__ float s_in2[THREADS_PER_BLOCK];
unsigned int xIndex = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
s_in1[threadIdx.x]=in1[xIndex];
s_in2[threadIdx.x]=in2[xIndex];
out[xIndex]=s_in1[threadIdx.x]+s_in2[threadIdx.x];
}
May someone suggest me a way to easily speed up such a code
There are basically no useful optimizations to make on an operation like vector addition. Because of the nature of the calculation, the code could only ever hope to reach 50% peak arithmetic throughput, and the requirement for three memory transactions per FLOP makes this an intrinsically memory bandwidth bound operation.
As a result, this:
__global__ void vecAdd(float *in1, float *in2, float *out, unsigned int len)
{
unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
if (xIndex < len) {
float x = in1[xIndex];
float y = in2[xIndex];
out[xIndex] = x + y;
}
}
is about the best performing variant on most recent hardware, if the block size is selected for maximum occupancy, and len is sufficiently large for example:
int minGrid, minBlockSize;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &minBlockSize, vecAdd);
int nblocks = (len / minBlockSize) + ((len % minBlockSize > 0) ? 1 : 0);
vecAdd<<<nblocks, minBlockSize>>>(x, y, z, len);

cuda kernel failed when block number smaller than max number

I have write a CUDA function same as cublasSdgmm in CUBLAS, and I find when I increase block number, function performance may be poorer, or even failed.
Here is the code, M = 9.6e6, S = 3, the best performance block number is 320, my GPU is GTX960, and the max block size is 2147483647 in X dimension.
__global__ void DgmmKernel(float *d_y, float *d_r, int M, int S){
int row = blockIdx.x*blockDim.x + threadIdx.x;
int col = blockIdx.y*blockDim.y + threadIdx.y;
while(row < M){
d_y[row + col * M] *= d_r[row];
row += blockDim.x * gridDim.x;
}
}
void Dgmm(float *d_y, float *d_r, int M, int S){
int xthreads_per_block = 1024;
dim3 dimBlock(xthreads_per_block, 1);
dim3 dimGrid(320, S);
DgmmKernel<<<dimBlock, dimGrid>>>(d_y, d_r, M, S);
}
I guess the reason is that there may be a resource limit in GPU, is it right?
If it is right, what specific resource limits the performance, the kernel function just reads two vectors and do a multiplication operation. And is there any method to improve performance on my GPU.
You have the block and grid dimension arguments reversed in your kernel launch, and your kernel should never be running. You should do something like this:
dim3 dimBlock(xthreads_per_block, 1);
dim3 dimGrid(320, S);
DgmmKernel<<<dimGrid, dimBlock>>>(d_y, d_r, M, S);
If your code contained appropriate runtime error checking, you would already be aware that the kernel launch is failing with an invalid configuration error for any value of S>3.

Cuda block/grid dimensions: when to use dim3?

I need some clearing up regarding the use of dim3 to set the number of threads in my CUDA kernel.
I have an image in a 1D float array, which I'm copying to the device with:
checkCudaErrors(cudaMemcpy( img_d, img.data, img.row * img.col * sizeof(float), cudaMemcpyHostToDevice));
Now I need to set the grid and block sizes to launch my kernel:
dim3 blockDims(512);
dim3 gridDims((unsigned int) ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<< gridDims, blockDims>>>(...)
I'm wondering: in this case, since the data is 1D, does it matter if I use a dim3 structure? Any benefits over using
unsigned int num_blocks = ceil(img.row * img.col * 3 / blockDims.x));
myKernel<<<num_blocks, 512>>>(...)
instead?
Also, is my understanding correct that when using dim3, I'll reference the thread ID with 2 indices inside my kernel:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
And when I'm not using dim3, I'll just use one index?
Thank you very much,
The way you arrange the data in memory is independently on how you would configure the threads of your kernel.
The memory is always a 1D continuous space of bytes. However, the access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.
dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.
The same happens for the blocks and the grid.
Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#dim3
So, in both cases: dim3 blockDims(512); and myKernel<<<num_blocks, 512>>>(...) you will always have access to threadIdx.y and threadIdx.z.
As the thread ids start at zero, you can calculate a memory position as a row major order using also the ydimension:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int gid = img.col * y + x;
because blockIdx.y and threadIdx.y will be zero.
To sumup, it does it matter if you use a dim3 structure. I would be clear where the configuration of the threads has been defined, and the 1D, 2D and 3D access pattern depends on how you are interpreting your data and also how you are accessing them by 1D, 2D and 3D blocks of threads.

Only half of the shared memory array is assigned

I see only half of the shared memory array is assigned, when I use Nsight stepped after s_f[sidx] = 5;
__global__ void BackProjectPixel(double* val,
double* projection,
double* focalPtPos,
double* pxlPos,
double* pxlGrid,
double* detPos,
double *detGridPos,
unsigned int nN,
unsigned int nS,
double perModDetAngle,
double perModSpaceAngle,
double perModAngle)
{
const double fx = focalPtPos[0];
const double fy = focalPtPos[1];
//extern __shared__ double s_f[64]; //
__shared__ double s_f[64]; //
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int j = (blockIdx.y * blockDim.y) + threadIdx.y;
unsigned int idx = j*nN + i;
unsigned int sidx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned int threadsPerSharedMem = 64;
if (sidx < threadsPerSharedMem)
{
s_f[sidx] = 5;
}
__syncthreads();
//double * angle;
//
if (sidx < threadsPerSharedMem)
{
s_f[idx] = TriPointAngle(detGridPos[0], detGridPos[1],fx, fy, pxlPos[idx*2], pxlPos[idx*2+1], nN);
}
}
Here is what I observed
I am wondering why there are only thirty-two 5? Shouldn't there be sixty-four 5 in s_f? Thanks.
Threads are executed in groups of threads (usually 32) which are also called warps. Warps group the threads in order. In your case one warp will get threads 0-31, the other 32-63. In your debugging context, you are probably seeing the results of only the warp that contains threads 0-31.
I am wondering why there are only thirty-two 5?
There are 32 fives because as mete says, kernels are executed simultaneously only by groups of threads of size 32, so called warps in CUDA terminology.
Shouldn't there be sixty-four 5 in s_f?
There will be 64 fives after the synchronization barrier, i.e. __syncthreads(). So if you place your breakpoint on the first instruction after the __syncthreads() call, you'll see all fives. Thats because by that time all the warps from one block will finish execution of all the code prior to __syncthreads().
How can I see all warps with Nsight?
You can see values for all the threads easily by putting this into watchfield:
s_f[sidx]
Although sidx value may become undefined due to optimizations, so I would better watch the value of:
s_f[((blockIdx.y * blockDim.y) + threadIdx.y) * nN + (blockIdx.x * blockDim.x) + threadIdx.x]
And indeed, if you want to investigate values for particular warp, then as Robert Crovella points out, you should use conditional breakpoints. If you want to break within the second warp, then something like this should work in case of two dimensional grid of two dimensional block (which I presume you are using):
((blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x) == 32
Because 32 is the index of the first thread within the second warp. For other combinations of block and grid dimensions see this useful cheatsheet.

Why does this CUDA example kernel have a for loop?

I have been looking at the following example from the official CUDA website:
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cufft
Download here: http://developer.download.nvidia.com/compute/DevZone/C/Projects/x64/simpleCUFFT.zip
It contains the following kernel:
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < size; i += numThreads)
{
a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale);
}
}
My question is, why is there a for loop here? Doesn't CUDA simultaneously call an array of thread? I removed the thread, replacing it with the following code and it produced the same output.
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
a[threadID] = ComplexScale(ComplexMul(a[threadID], b[threadID]), scale);
}
As this is an official example on the CUDA website, I imagine I must be missing something.
Your version is basically what happens when numThreads is equal to size (but only then).
What the official example does is the following: Suppose numThreads is equal to 4 (for simplicity, usually it will be much larger), and consider the array positions (both for a and b):
a or b x x x x x x x x
thread that works here 0 1 2 3 0 1 2 3
Then the first thread will work on all array position divisible by 4, et cetera.
The problem with your version is that the caller of your function will have to make sure that there are as many threads as size is large. For example, if you call your version with a 1-dim grid and both gridDim.x and blockDim.x being 2, but on vectors of length 8, then half of your vector isn't processed!
The official example works regardless - no matter how many threads the caller assigns to it, the entire vector will be processed.