Modifying the basic example VECADD to use the shared memory - cuda

I wrote the following kernel to use the shared memory into the basic CUDA example vecadd (sum of two vectors). The code works, but the elapsed time for the kernel execution is the same as the basic original code. May someone suggest me a way to easily speed up such a code?
__global__ void vecAdd(float *in1, float *in2, float *out,long int len)
{
__shared__ float s_in1[THREADS_PER_BLOCK];
__shared__ float s_in2[THREADS_PER_BLOCK];
unsigned int xIndex = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
s_in1[threadIdx.x]=in1[xIndex];
s_in2[threadIdx.x]=in2[xIndex];
out[xIndex]=s_in1[threadIdx.x]+s_in2[threadIdx.x];
}

May someone suggest me a way to easily speed up such a code
There are basically no useful optimizations to make on an operation like vector addition. Because of the nature of the calculation, the code could only ever hope to reach 50% peak arithmetic throughput, and the requirement for three memory transactions per FLOP makes this an intrinsically memory bandwidth bound operation.
As a result, this:
__global__ void vecAdd(float *in1, float *in2, float *out, unsigned int len)
{
unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
if (xIndex < len) {
float x = in1[xIndex];
float y = in2[xIndex];
out[xIndex] = x + y;
}
}
is about the best performing variant on most recent hardware, if the block size is selected for maximum occupancy, and len is sufficiently large for example:
int minGrid, minBlockSize;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &minBlockSize, vecAdd);
int nblocks = (len / minBlockSize) + ((len % minBlockSize > 0) ? 1 : 0);
vecAdd<<<nblocks, minBlockSize>>>(x, y, z, len);

Related

parallelizing nested for loop with cuda have big limit

I am new to CUDA. I'm trying to write a CUDA kernel to perform the following piece of code.
for(int oz=0;oz<count1;oz++)
{
for(int ox=0;ox<scale+1;ox++)
{
for(int xhn=0;xhn<Wjh;xhn++)
{
for(int yhn=0;yhn<Wjv;yhn++)
{
//int numx=xhn+ox*Wjh;
int numx=oz*(scale+1)*Wjh+ox*Wjh+xhn;
int src2=yhn+xhn*Wjv;
Ic_real[src2]=Ic_real[src2]+Sr[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hr_table[numx]-Si[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hi_table[numx];
Ic_img[src2]=Ic_img[src2]+Sr[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hi_table[numx]+Si[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hr_table[numx];
}
}
}
}
the value Wjh=1080,Wjv=1920,scale=255;oz>=4.This is what I have currently,but my code can only perform when count1<=4, if oz>4 ,it doesn't work,does anyone know what should I do ? Cheers
__global__ void lut_kernel(float *Sr,float *Si,dim3 size,int Wjh,int Wjv,float *vr,float *vi,
float *hr,float *hi,float *Ic_re,float *Ic_im)
{
__shared__ float cachere[threadPerblock];
__shared__ float cacheim[threadPerblock];
int blockId=blockIdx.x + blockIdx.y * gridDim.x;
int cacheIndex=threadIdx.y*blockDim.x+threadIdx.x;
int z=threadIdx.x;
int x=threadIdx.y;
int tid1=threadIdx.y*blockDim.x+threadIdx.x;
//int tid= blockId * (blockDim.x * blockDim.y)
// + (threadIdx.y * blockDim.x) + threadIdx.x;
int countnum=0;
float re=0.0f;
float im=0.0f;
float re_value=0.0f;
float im_value=0.0f;
if (z<4 && x<256)
{
int src2=z*(scale+1)*Wjh+x*Wjh+blockIdx.y;
re=Sr[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hr[src2]-Si[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hi[src2];
im=Sr[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hi[src2]+Si[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hr[src2];
}
cachere[cacheIndex]=re;
cacheim[cacheIndex]=im;
__syncthreads();
int index=threadPerblock/2;
while(index!=0)
{
if(cacheIndex<index)
{
cachere[cacheIndex]+=cachere[cacheIndex+index];
cacheim[cacheIndex]+=cacheim[cacheIndex+index];
}
index/=2;
}
if(cacheIndex==0)
{
Ic_re[blockId]=cachere[0];
Ic_im[blockId]=cacheim[0];
//printf("Ic= %d,blockId= %d\n",Ic_re[blockId],blockId);
}
}
the kernel parameter is:
dim3 dimBlock(count1,256);
dim3 dimGrid(Wjv,Wjh);
lut_kernel<<<dimGrid,dimBlock>>>(d_Sr,d_Si,size,Wjh,Wjv,dvr_table,dvi_table,dhr_table,dhi_table,dIc_re,dIc_im);
if count1>4,what shuold I do to parallelize the nested for code?
I checked the code briefly and it seems that the computation of Ic_img and Ic_real elements is easy to parallelize (count1, scale+1, Wjh, Wjv have no dependency at all among each other). Thus, there's no need to have shared variables and while loops in the kernel; it's easy to implement like below, where an extra parameter int numElements = count1 *(scale+1) * Wjh * Wjv.
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
//....
}
The code will be significantly much easier to maintain and eliminate bugs prone to long codes like your example. If src2 values do not repeat at all in the innermost loop, the performance is close to optimal as well. If 'src2' may repeat, use an expression with 'atomicAdd' so that the results will be correct as expected; with atomicAdd, the performance may not be optimal, but at least one correctly implemented bug free kernel is successfully implemented. If it causes performance bottleneck, then modify it by trying and experimenting some different implementations.

Find the sum reduction issue with size of thread in CUDA

In previous post here, I asked about how to calculate sum of an array with reduction. Now I have a new problem, with larger image, my result is not correct, it change every time I run.
I tested with 96*96 image size array sample
First time result: 28169.046875
Second time result: 28169.048828
Expected result: 28169.031250
Here is my code:
#include <stdio.h>
#include <cuda.h>
__global__ void calculate_threshold_kernel(float * input, float * output)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float partialSum[256];
partialSum[t] = input[idx];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride *= 2)
{
if (t % (2 * stride) == 0)
partialSum[t] += partialSum[t + stride];
__syncthreads();
}
if (t == 0)
{
atomicAdd(output,partialSum[0]);
}
}
int main( void )
{
float *d_array, *d_output,*h_input, *h_output;
int img_height = 96;
int img_width = 96;
int input_elements = img_height * img_width;
h_input = (float*) malloc(sizeof(float) * input_elements);
cudaMalloc((void**)&d_output, sizeof(float));
cudaMemset(d_output, 0, sizeof(float));
h_output = (float*)malloc(sizeof(float));
cudaMalloc((void**)&d_array, input_elements*sizeof(float));
float array[] = {[array sample]};
for (int i = 0; i < input_elements; i++)
{
h_input[i] = array[i];
}
cudaMemcpy(d_array, h_input, input_elements*sizeof(float), cudaMemcpyHostToDevice);
dim3 blocksize(256);
dim3 gridsize(input_elements/blocksize.x);
calculate_threshold_kernel<<<gridsize,blocksize>>>(d_array, d_output);
cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
printf("Sum from GPU = %f\n", *h_output);
return 0;
}
While the answer from Kangshiyin is correct about floating point accuracy and floating point arithmetic being non-commutative, he is not correct about the reason behind the results differing from one run to the other.
Floating point arithmetic is non-commutative, this means operations performed in different order can return different results. For example (((a+b)+c)+d) may be slightly different than ((a+b)+(c+d)) for certain values of a,b,c and d. But both these results should not vary from run to run.
Your result vary between different runs because atomicAdd results in the order of additions being different. Using double also does not guarantee that the results will be the same between different runs.
There are ways to implement parallel reduction without atomicAdd as the final step (ex: use a second kernel launch to add partial sums from the first launch) which can provide consistent (yet slightly different from CPU) results.
float has a limited precision up to 7 demical digits as explained here.
https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
The result changes because operations on float are non-commutative and you are using parallel reduction.
The result changes because operations on float are non-commutative and you are using atomicAdd(), which can not keep the order of additions.
You could use double instead if you want more accurate result.

Only half of the shared memory array is assigned

I see only half of the shared memory array is assigned, when I use Nsight stepped after s_f[sidx] = 5;
__global__ void BackProjectPixel(double* val,
double* projection,
double* focalPtPos,
double* pxlPos,
double* pxlGrid,
double* detPos,
double *detGridPos,
unsigned int nN,
unsigned int nS,
double perModDetAngle,
double perModSpaceAngle,
double perModAngle)
{
const double fx = focalPtPos[0];
const double fy = focalPtPos[1];
//extern __shared__ double s_f[64]; //
__shared__ double s_f[64]; //
unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int j = (blockIdx.y * blockDim.y) + threadIdx.y;
unsigned int idx = j*nN + i;
unsigned int sidx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned int threadsPerSharedMem = 64;
if (sidx < threadsPerSharedMem)
{
s_f[sidx] = 5;
}
__syncthreads();
//double * angle;
//
if (sidx < threadsPerSharedMem)
{
s_f[idx] = TriPointAngle(detGridPos[0], detGridPos[1],fx, fy, pxlPos[idx*2], pxlPos[idx*2+1], nN);
}
}
Here is what I observed
I am wondering why there are only thirty-two 5? Shouldn't there be sixty-four 5 in s_f? Thanks.
Threads are executed in groups of threads (usually 32) which are also called warps. Warps group the threads in order. In your case one warp will get threads 0-31, the other 32-63. In your debugging context, you are probably seeing the results of only the warp that contains threads 0-31.
I am wondering why there are only thirty-two 5?
There are 32 fives because as mete says, kernels are executed simultaneously only by groups of threads of size 32, so called warps in CUDA terminology.
Shouldn't there be sixty-four 5 in s_f?
There will be 64 fives after the synchronization barrier, i.e. __syncthreads(). So if you place your breakpoint on the first instruction after the __syncthreads() call, you'll see all fives. Thats because by that time all the warps from one block will finish execution of all the code prior to __syncthreads().
How can I see all warps with Nsight?
You can see values for all the threads easily by putting this into watchfield:
s_f[sidx]
Although sidx value may become undefined due to optimizations, so I would better watch the value of:
s_f[((blockIdx.y * blockDim.y) + threadIdx.y) * nN + (blockIdx.x * blockDim.x) + threadIdx.x]
And indeed, if you want to investigate values for particular warp, then as Robert Crovella points out, you should use conditional breakpoints. If you want to break within the second warp, then something like this should work in case of two dimensional grid of two dimensional block (which I presume you are using):
((blockIdx.x + blockIdx.y * gridDim.x) * (blockDim.x * blockDim.y) + (threadIdx.y * blockDim.x) + threadIdx.x) == 32
Because 32 is the index of the first thread within the second warp. For other combinations of block and grid dimensions see this useful cheatsheet.

Can multiple blocks and threads write to the same output?

I have the following CUDA kernel code which computes the sum squared error of two arrays.
__global__ void kSquaredError(double* data, double* recon, double* error,
unsigned int num_elements)
{
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
*error += pow(data[i] - recon[i], 2);
}
}
I need a single scalar output (error). In this case, it seems like all threads are writing to error simultaneously. Is there some way I need to synchronize it?
Currently I'm getting a bad result so I'm guessing there is some issue.
The implementation you are doing now is subject to race conditions due to the fact that all threads try to update the same global memory address at the same time. You could easily put a atomicAdd function instead of *error += pow... but that suffers from performance issues due to it being serialized on each update.
Instead you should try and and do a reduction using the shared memory, as following:
_global__ void kSquaredError(double* data, double* recon, double* error, unsigned int num_elements) {
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tid = threadIdx.x;
__shared__ double serror[blockDim.x];//temporary storage of each threads error
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
serror[tid] += pow(data[i] - recon[i], 2);//put each threads value in shared memory
}
__syncthreads();
int i = blockDim.x >> 1; //halve the threads
for(;i>0;i>>=1) {//reduction in shared memory
if(tid<i) {
serror[tid] += serror[tid+i];
__syncthreads();//make shure all threads are at the same state and have written to shared memory
}
}
if(tid == 0) {//thread 0 updates the value in global memory
atomicAdd(error,serror[tid]);// same as *error += serror[tid]; but atomic
}
}
It works by the following principle, each thread have its own temporary variable where it calculates the sum of the error for all its input, when it have finished all threads converge at the __syncthreads instruction to ensure that all data is complete.
Now half of all the threads in the block will take one value from the corresponding other half add add it to its own, half the threads again and do it again until you are left with one thread(thread 0) which will have the total sum.
Now thread 0 will uppdate the global memory with an atomicAdd function to avoid race condition with other blocks if there is any.
If we would just use the first example and use atomicAdd on every assignment. You would have gridDim.x*blockDim.x*num_elements atomic functions that would be serialized, now we have only gridDim.x atomic functions which is a lot less.
See Optimizing Parallel Reduction in CUDA for further reading on how reduction using cuda works.
Edit
Added if statement in the reduction for loop, forgot that.

CUDA 2D Convolution kernel

I'm a beginner in CUDA and I'm trying to implement a Sobel Edge detection kernel.
I'm using this code for it but it doesn't work.
Can anyone tell me what is wrong with it. I just get some -1's and some really big values.
__global__ void EdgeDetect_Hor(int *gpu_Edge_Hor, int *gpu_P,
int *gpu_Hor, int W, int H)
{
int X = threadIdx.x;
int Y = threadIdx.y;
int sum = 0;
int k1, k2;
int min1, min2;
for (k1 = 0; k1 < 3; k1++)
for(k2 = 0; k2 <3;k2++)
sum += gpu_Hor[k1*3+k2]*gpu_P[(X-k1)*H+Y-k2];
gpu_Edge_Hor[X*H+Y] = sum/5000;
}
I call this kernel like this:
dim3 dimBlock(W,H);
dim3 dimGrid(1,1);
EdgeDetect_Hor<<<dimGrid, dimBlock>>>(gpu_Edge_Hor, gpu_P, gpu_Hor, W, H);
First, your problem is that you process image of 480x720 pixels. CUDA supports maximum size of thread block 1024 for compute capability 2.0 and greater and 512 for previous. So you can't execute so many threads in one block. The line dim3 dimBlock(W,H); is incorrect. You should divide your threads to several blocks.
Another problem is that CUDA process data in row-major order. So you should change you memory access pattern.
Right memory access pattern for 2D arrays in CUDA is
BaseAddress + width * Y + X
where
unsigned int X = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int Y = blockIdx.y * blockDim.y + threadIdx.y;