parallelizing nested for loop with cuda have big limit - cuda

I am new to CUDA. I'm trying to write a CUDA kernel to perform the following piece of code.
for(int oz=0;oz<count1;oz++)
{
for(int ox=0;ox<scale+1;ox++)
{
for(int xhn=0;xhn<Wjh;xhn++)
{
for(int yhn=0;yhn<Wjv;yhn++)
{
//int numx=xhn+ox*Wjh;
int numx=oz*(scale+1)*Wjh+ox*Wjh+xhn;
int src2=yhn+xhn*Wjv;
Ic_real[src2]=Ic_real[src2]+Sr[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hr_table[numx]-Si[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hi_table[numx];
Ic_img[src2]=Ic_img[src2]+Sr[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hi_table[numx]+Si[oz*(scale+1)*Wjv+ox*Wjv+yhn]*Hr_table[numx];
}
}
}
}
the value Wjh=1080,Wjv=1920,scale=255;oz>=4.This is what I have currently,but my code can only perform when count1<=4, if oz>4 ,it doesn't work,does anyone know what should I do ? Cheers
__global__ void lut_kernel(float *Sr,float *Si,dim3 size,int Wjh,int Wjv,float *vr,float *vi,
float *hr,float *hi,float *Ic_re,float *Ic_im)
{
__shared__ float cachere[threadPerblock];
__shared__ float cacheim[threadPerblock];
int blockId=blockIdx.x + blockIdx.y * gridDim.x;
int cacheIndex=threadIdx.y*blockDim.x+threadIdx.x;
int z=threadIdx.x;
int x=threadIdx.y;
int tid1=threadIdx.y*blockDim.x+threadIdx.x;
//int tid= blockId * (blockDim.x * blockDim.y)
// + (threadIdx.y * blockDim.x) + threadIdx.x;
int countnum=0;
float re=0.0f;
float im=0.0f;
float re_value=0.0f;
float im_value=0.0f;
if (z<4 && x<256)
{
int src2=z*(scale+1)*Wjh+x*Wjh+blockIdx.y;
re=Sr[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hr[src2]-Si[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hi[src2];
im=Sr[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hi[src2]+Si[z*(scale+1)*Wjv+x*Wjv+blockIdx.x]*hr[src2];
}
cachere[cacheIndex]=re;
cacheim[cacheIndex]=im;
__syncthreads();
int index=threadPerblock/2;
while(index!=0)
{
if(cacheIndex<index)
{
cachere[cacheIndex]+=cachere[cacheIndex+index];
cacheim[cacheIndex]+=cacheim[cacheIndex+index];
}
index/=2;
}
if(cacheIndex==0)
{
Ic_re[blockId]=cachere[0];
Ic_im[blockId]=cacheim[0];
//printf("Ic= %d,blockId= %d\n",Ic_re[blockId],blockId);
}
}
the kernel parameter is:
dim3 dimBlock(count1,256);
dim3 dimGrid(Wjv,Wjh);
lut_kernel<<<dimGrid,dimBlock>>>(d_Sr,d_Si,size,Wjh,Wjv,dvr_table,dvi_table,dhr_table,dhi_table,dIc_re,dIc_im);
if count1>4,what shuold I do to parallelize the nested for code?

I checked the code briefly and it seems that the computation of Ic_img and Ic_real elements is easy to parallelize (count1, scale+1, Wjh, Wjv have no dependency at all among each other). Thus, there's no need to have shared variables and while loops in the kernel; it's easy to implement like below, where an extra parameter int numElements = count1 *(scale+1) * Wjh * Wjv.
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < numElements) {
//....
}
The code will be significantly much easier to maintain and eliminate bugs prone to long codes like your example. If src2 values do not repeat at all in the innermost loop, the performance is close to optimal as well. If 'src2' may repeat, use an expression with 'atomicAdd' so that the results will be correct as expected; with atomicAdd, the performance may not be optimal, but at least one correctly implemented bug free kernel is successfully implemented. If it causes performance bottleneck, then modify it by trying and experimenting some different implementations.

Related

Modifying the basic example VECADD to use the shared memory

I wrote the following kernel to use the shared memory into the basic CUDA example vecadd (sum of two vectors). The code works, but the elapsed time for the kernel execution is the same as the basic original code. May someone suggest me a way to easily speed up such a code?
__global__ void vecAdd(float *in1, float *in2, float *out,long int len)
{
__shared__ float s_in1[THREADS_PER_BLOCK];
__shared__ float s_in2[THREADS_PER_BLOCK];
unsigned int xIndex = blockIdx.x * THREADS_PER_BLOCK + threadIdx.x;
s_in1[threadIdx.x]=in1[xIndex];
s_in2[threadIdx.x]=in2[xIndex];
out[xIndex]=s_in1[threadIdx.x]+s_in2[threadIdx.x];
}
May someone suggest me a way to easily speed up such a code
There are basically no useful optimizations to make on an operation like vector addition. Because of the nature of the calculation, the code could only ever hope to reach 50% peak arithmetic throughput, and the requirement for three memory transactions per FLOP makes this an intrinsically memory bandwidth bound operation.
As a result, this:
__global__ void vecAdd(float *in1, float *in2, float *out, unsigned int len)
{
unsigned int xIndex = blockIdx.x * blockDim.x + threadIdx.x;
if (xIndex < len) {
float x = in1[xIndex];
float y = in2[xIndex];
out[xIndex] = x + y;
}
}
is about the best performing variant on most recent hardware, if the block size is selected for maximum occupancy, and len is sufficiently large for example:
int minGrid, minBlockSize;
cudaOccupancyMaxPotentialBlockSize(&minGrid, &minBlockSize, vecAdd);
int nblocks = (len / minBlockSize) + ((len % minBlockSize > 0) ? 1 : 0);
vecAdd<<<nblocks, minBlockSize>>>(x, y, z, len);

Find the sum reduction issue with size of thread in CUDA

In previous post here, I asked about how to calculate sum of an array with reduction. Now I have a new problem, with larger image, my result is not correct, it change every time I run.
I tested with 96*96 image size array sample
First time result: 28169.046875
Second time result: 28169.048828
Expected result: 28169.031250
Here is my code:
#include <stdio.h>
#include <cuda.h>
__global__ void calculate_threshold_kernel(float * input, float * output)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float partialSum[256];
partialSum[t] = input[idx];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride *= 2)
{
if (t % (2 * stride) == 0)
partialSum[t] += partialSum[t + stride];
__syncthreads();
}
if (t == 0)
{
atomicAdd(output,partialSum[0]);
}
}
int main( void )
{
float *d_array, *d_output,*h_input, *h_output;
int img_height = 96;
int img_width = 96;
int input_elements = img_height * img_width;
h_input = (float*) malloc(sizeof(float) * input_elements);
cudaMalloc((void**)&d_output, sizeof(float));
cudaMemset(d_output, 0, sizeof(float));
h_output = (float*)malloc(sizeof(float));
cudaMalloc((void**)&d_array, input_elements*sizeof(float));
float array[] = {[array sample]};
for (int i = 0; i < input_elements; i++)
{
h_input[i] = array[i];
}
cudaMemcpy(d_array, h_input, input_elements*sizeof(float), cudaMemcpyHostToDevice);
dim3 blocksize(256);
dim3 gridsize(input_elements/blocksize.x);
calculate_threshold_kernel<<<gridsize,blocksize>>>(d_array, d_output);
cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
printf("Sum from GPU = %f\n", *h_output);
return 0;
}
While the answer from Kangshiyin is correct about floating point accuracy and floating point arithmetic being non-commutative, he is not correct about the reason behind the results differing from one run to the other.
Floating point arithmetic is non-commutative, this means operations performed in different order can return different results. For example (((a+b)+c)+d) may be slightly different than ((a+b)+(c+d)) for certain values of a,b,c and d. But both these results should not vary from run to run.
Your result vary between different runs because atomicAdd results in the order of additions being different. Using double also does not guarantee that the results will be the same between different runs.
There are ways to implement parallel reduction without atomicAdd as the final step (ex: use a second kernel launch to add partial sums from the first launch) which can provide consistent (yet slightly different from CPU) results.
float has a limited precision up to 7 demical digits as explained here.
https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
The result changes because operations on float are non-commutative and you are using parallel reduction.
The result changes because operations on float are non-commutative and you are using atomicAdd(), which can not keep the order of additions.
You could use double instead if you want more accurate result.

CUDA cudaMemcpy2D not giving expected results [duplicate]

How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

Tricky array arithmetics inside a __global__ kernel (CUDA samples)

I have a question about code from CUDA sample "CUDA Separable Convolution" . In order to make row-convolution, this code first loads data in shared memory. Using pointer arithmetics, each thread moves the input pointers into their own position, and after that writes some piece of global memory into shared memory. Here is the piece of code that confuses me:
__global__ void convolutionRowsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
)
{
__shared__ float s_Data[ROWS_BLOCKDIM_Y][(ROWS_RESULT_STEPS + 2 * ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X];
//Offset to the left halo edge
const int baseX = (blockIdx.x * ROWS_RESULT_STEPS - ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X + threadIdx.x;
const int baseY = blockIdx.y * ROWS_BLOCKDIM_Y + threadIdx.y;
d_Src += baseY * pitch + baseX;
d_Dst += baseY * pitch + baseX;
//Load main data
#pragma unroll
for (int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++)
{
s_Data[threadIdx.y][threadIdx.x + i * ROWS_BLOCKDIM_X] = d_Src[i * ROWS_BLOCKDIM_X];
}
...
As far as I understand this code, each thread will calculate their own values of baseX and baseY, and after that all active threads will start to increase pointers d_Src and d_Dst simultaneously.
So, according to my knowledge, this would be correct, if arrays d_Src and d_Dst were in local memory (e.g. each thread would have there own copy of this arrays). But this arrays are in global device memory! So what will happen, all active threads will increase the pointers, and the result will be incorrect. Can one explain me, why this works?
Thanks
It works because every thread has its own copy of the pointer.
void foo(float* bar){
bar++;
}
float* test = 0;
foo(test);
cout<<test<<endl; //will print 0

Can multiple blocks and threads write to the same output?

I have the following CUDA kernel code which computes the sum squared error of two arrays.
__global__ void kSquaredError(double* data, double* recon, double* error,
unsigned int num_elements)
{
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
*error += pow(data[i] - recon[i], 2);
}
}
I need a single scalar output (error). In this case, it seems like all threads are writing to error simultaneously. Is there some way I need to synchronize it?
Currently I'm getting a bad result so I'm guessing there is some issue.
The implementation you are doing now is subject to race conditions due to the fact that all threads try to update the same global memory address at the same time. You could easily put a atomicAdd function instead of *error += pow... but that suffers from performance issues due to it being serialized on each update.
Instead you should try and and do a reduction using the shared memory, as following:
_global__ void kSquaredError(double* data, double* recon, double* error, unsigned int num_elements) {
const unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int tid = threadIdx.x;
__shared__ double serror[blockDim.x];//temporary storage of each threads error
for (unsigned int i = idx; i < num_elements; i += blockDim.x * gridDim.x) {
serror[tid] += pow(data[i] - recon[i], 2);//put each threads value in shared memory
}
__syncthreads();
int i = blockDim.x >> 1; //halve the threads
for(;i>0;i>>=1) {//reduction in shared memory
if(tid<i) {
serror[tid] += serror[tid+i];
__syncthreads();//make shure all threads are at the same state and have written to shared memory
}
}
if(tid == 0) {//thread 0 updates the value in global memory
atomicAdd(error,serror[tid]);// same as *error += serror[tid]; but atomic
}
}
It works by the following principle, each thread have its own temporary variable where it calculates the sum of the error for all its input, when it have finished all threads converge at the __syncthreads instruction to ensure that all data is complete.
Now half of all the threads in the block will take one value from the corresponding other half add add it to its own, half the threads again and do it again until you are left with one thread(thread 0) which will have the total sum.
Now thread 0 will uppdate the global memory with an atomicAdd function to avoid race condition with other blocks if there is any.
If we would just use the first example and use atomicAdd on every assignment. You would have gridDim.x*blockDim.x*num_elements atomic functions that would be serialized, now we have only gridDim.x atomic functions which is a lot less.
See Optimizing Parallel Reduction in CUDA for further reading on how reduction using cuda works.
Edit
Added if statement in the reduction for loop, forgot that.