Why does this CUDA example kernel have a for loop? - cuda

I have been looking at the following example from the official CUDA website:
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-cufft
Download here: http://developer.download.nvidia.com/compute/DevZone/C/Projects/x64/simpleCUFFT.zip
It contains the following kernel:
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int numThreads = blockDim.x * gridDim.x;
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = threadID; i < size; i += numThreads)
{
a[i] = ComplexScale(ComplexMul(a[i], b[i]), scale);
}
}
My question is, why is there a for loop here? Doesn't CUDA simultaneously call an array of thread? I removed the thread, replacing it with the following code and it produced the same output.
// Complex pointwise multiplication
static __global__ void ComplexPointwiseMulAndScale(Complex *a, const Complex *b, int size, float scale)
{
const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
a[threadID] = ComplexScale(ComplexMul(a[threadID], b[threadID]), scale);
}
As this is an official example on the CUDA website, I imagine I must be missing something.

Your version is basically what happens when numThreads is equal to size (but only then).
What the official example does is the following: Suppose numThreads is equal to 4 (for simplicity, usually it will be much larger), and consider the array positions (both for a and b):
a or b x x x x x x x x
thread that works here 0 1 2 3 0 1 2 3
Then the first thread will work on all array position divisible by 4, et cetera.
The problem with your version is that the caller of your function will have to make sure that there are as many threads as size is large. For example, if you call your version with a 1-dim grid and both gridDim.x and blockDim.x being 2, but on vectors of length 8, then half of your vector isn't processed!
The official example works regardless - no matter how many threads the caller assigns to it, the entire vector will be processed.

Related

Find the sum reduction issue with size of thread in CUDA

In previous post here, I asked about how to calculate sum of an array with reduction. Now I have a new problem, with larger image, my result is not correct, it change every time I run.
I tested with 96*96 image size array sample
First time result: 28169.046875
Second time result: 28169.048828
Expected result: 28169.031250
Here is my code:
#include <stdio.h>
#include <cuda.h>
__global__ void calculate_threshold_kernel(float * input, float * output)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int t = threadIdx.x;
__shared__ float partialSum[256];
partialSum[t] = input[idx];
__syncthreads();
for (int stride = 1; stride < blockDim.x; stride *= 2)
{
if (t % (2 * stride) == 0)
partialSum[t] += partialSum[t + stride];
__syncthreads();
}
if (t == 0)
{
atomicAdd(output,partialSum[0]);
}
}
int main( void )
{
float *d_array, *d_output,*h_input, *h_output;
int img_height = 96;
int img_width = 96;
int input_elements = img_height * img_width;
h_input = (float*) malloc(sizeof(float) * input_elements);
cudaMalloc((void**)&d_output, sizeof(float));
cudaMemset(d_output, 0, sizeof(float));
h_output = (float*)malloc(sizeof(float));
cudaMalloc((void**)&d_array, input_elements*sizeof(float));
float array[] = {[array sample]};
for (int i = 0; i < input_elements; i++)
{
h_input[i] = array[i];
}
cudaMemcpy(d_array, h_input, input_elements*sizeof(float), cudaMemcpyHostToDevice);
dim3 blocksize(256);
dim3 gridsize(input_elements/blocksize.x);
calculate_threshold_kernel<<<gridsize,blocksize>>>(d_array, d_output);
cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
printf("Sum from GPU = %f\n", *h_output);
return 0;
}
While the answer from Kangshiyin is correct about floating point accuracy and floating point arithmetic being non-commutative, he is not correct about the reason behind the results differing from one run to the other.
Floating point arithmetic is non-commutative, this means operations performed in different order can return different results. For example (((a+b)+c)+d) may be slightly different than ((a+b)+(c+d)) for certain values of a,b,c and d. But both these results should not vary from run to run.
Your result vary between different runs because atomicAdd results in the order of additions being different. Using double also does not guarantee that the results will be the same between different runs.
There are ways to implement parallel reduction without atomicAdd as the final step (ex: use a second kernel launch to add partial sums from the first launch) which can provide consistent (yet slightly different from CPU) results.
float has a limited precision up to 7 demical digits as explained here.
https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems
The result changes because operations on float are non-commutative and you are using parallel reduction.
The result changes because operations on float are non-commutative and you are using atomicAdd(), which can not keep the order of additions.
You could use double instead if you want more accurate result.

CUDA cudaMemcpy2D not giving expected results [duplicate]

How do I initialize device array which is allocated using cudaMalloc()?
I tried cudaMemset, but it fails to initialize all values except 0.code, for cudaMemset looks like below, where value is initialized to 5.
cudaMemset(devPtr,value,number_bytes)
As you are discovering, cudaMemset works like the C standard library memset. Quoting from the documentation:
cudaError_t cudaMemset ( void * devPtr,
int value,
size_t count
)
Fills the first count bytes of the memory area pointed to by devPtr
with the constant byte value value.
So value is a byte value. If you do something like:
int *devPtr;
cudaMalloc((void **)&devPtr,number_bytes);
const int value = 5;
cudaMemset(devPtr,value,number_bytes);
what you are asking to happen is that each byte of devPtr will be set to 5. If devPtr was a an array of integers, the result would be each integer word would have the value 84215045. This is probably not what you had in mind.
Using the runtime API, what you could do is write your own generic kernel to do this. It could be as simple as
template<typename T>
__global__ void initKernel(T * devPtr, const T val, const size_t nwords)
{
int tidx = threadIdx.x + blockDim.x * blockIdx.x;
int stride = blockDim.x * gridDim.x;
for(; tidx < nwords; tidx += stride)
devPtr[tidx] = val;
}
(standard disclaimer: written in browser, never compiled, never tested, use at own risk).
Just instantiate the template for the types you need and call it with a suitable grid and block size, paying attention to the last argument now being a word count, not a byte count as in cudaMemset. This isn't really any different to what cudaMemset does anyway, using that API call results in a kernel launch which is do too different to what I posted above.
Alternatively, if you can use the driver API, there is cuMemsetD16 and cuMemsetD32, which do the same thing, but for half and full 32 bit word types. If you need to do set 64 bit or larger types (so doubles or vector types), your best option is to use your own kernel.
I also needed a solution to this question and I didn't really understand the other proposed solution. Particularly I didn't understand why it iterates over the grid blocks for(; tidx < nwords; tidx += stride) and for that matter, the kernel invocation and why using the counter-intuitive word sizes.
Therefore I created a much simpler monolithic generic kernel and customized it with strides i.e. you may use it to initialize a matrix in multiple ways e.g. set rows or columns to any value:
template <typename T>
__global__ void kernelInitializeArray(T* __restrict__ a, const T value,
const size_t n, const size_t incx) {
int tid = threadIdx.x + blockDim.x * blockIdx.x;
if (tid*incx < n) {
a[tid*incx] = value;
}
}
Then you may invoke the kernel like this:
template <typename T>
void deviceInitializeArray(T* a, const T value, const size_t n, const size_t incx) {
int number_of_blocks = ((n / incx) + BLOCK_SIZE - 1) / BLOCK_SIZE;
dim3 gridDim(number_of_blocks, 1);
dim3 blockDim(BLOCK_SIZE, 1);
kernelInitializeArray<T> <<<gridDim, blockDim>>>(a, value, n, incx);
}

Tricky array arithmetics inside a __global__ kernel (CUDA samples)

I have a question about code from CUDA sample "CUDA Separable Convolution" . In order to make row-convolution, this code first loads data in shared memory. Using pointer arithmetics, each thread moves the input pointers into their own position, and after that writes some piece of global memory into shared memory. Here is the piece of code that confuses me:
__global__ void convolutionRowsKernel(
float *d_Dst,
float *d_Src,
int imageW,
int imageH,
int pitch
)
{
__shared__ float s_Data[ROWS_BLOCKDIM_Y][(ROWS_RESULT_STEPS + 2 * ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X];
//Offset to the left halo edge
const int baseX = (blockIdx.x * ROWS_RESULT_STEPS - ROWS_HALO_STEPS) * ROWS_BLOCKDIM_X + threadIdx.x;
const int baseY = blockIdx.y * ROWS_BLOCKDIM_Y + threadIdx.y;
d_Src += baseY * pitch + baseX;
d_Dst += baseY * pitch + baseX;
//Load main data
#pragma unroll
for (int i = ROWS_HALO_STEPS; i < ROWS_HALO_STEPS + ROWS_RESULT_STEPS; i++)
{
s_Data[threadIdx.y][threadIdx.x + i * ROWS_BLOCKDIM_X] = d_Src[i * ROWS_BLOCKDIM_X];
}
...
As far as I understand this code, each thread will calculate their own values of baseX and baseY, and after that all active threads will start to increase pointers d_Src and d_Dst simultaneously.
So, according to my knowledge, this would be correct, if arrays d_Src and d_Dst were in local memory (e.g. each thread would have there own copy of this arrays). But this arrays are in global device memory! So what will happen, all active threads will increase the pointers, and the result will be incorrect. Can one explain me, why this works?
Thanks
It works because every thread has its own copy of the pointer.
void foo(float* bar){
bar++;
}
float* test = 0;
foo(test);
cout<<test<<endl; //will print 0

Cuda Kernel with reduction - logic errors for dot product of 2 matrices

I am just starting off with CUDA and am trying to wrap my brain around CUDA reduction algorithm. In my case, I have been trying to get the dot product of two matrices. But I am getting the right answer for only matrices with size 2. For any other size matrices, I am getting it wrong.
This is only the test so I am keeping matrix size very small. Only about 100 so only 1 block would fit it all.
Any help would be greatly appreciated. Thanks!
Here is the regular code
float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b
float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
hx += ha[i] * hb[i];
Here is my cuda kernel
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
dc[tid] = 0;
for (int stride = 1; stride < n; stride *= 2) {
if (tid % (2 * stride) == 0)
dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
__syncthreads();
}
}
My complete code : http://pastebin.com/zS85URX5
Hopefully you can figure out why it works for the n=2 case, so let's skip that, and take a look at why it fails for some other case, let's choose n=4. When n = 4, you have 4 threads, numbered 0 to 3.
In the first iteration of your for-loop, stride = 1, so the threads that pass the if test are threads 0 and 2.
thread 0: dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2: dc[2] += da[2]*db[2] + da[3]*db[3];
So far so good. In the second iteration of your for loop, stride is 2, so the thread that passes the if test is thread 0 (only).
thread 0: dc[0] += da[0]*db[0] + da[2]*db[2];
But this doesn't make sense and is not what we want at all. What we want is something like:
dc[0] += dc[2];
So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):
// CUDA kernel code
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
// do multiplication in parallel for full width of threads
dc[tid] = da[tid] * db[tid];
// wait for all threads to complete multiply step
__syncthreads();
int stride = blockDim.x;
while (stride > 1){
// handle odd step
if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
// successively divide problem by 2
stride >>= 1;
// add each upper half element to each lower half element
if (tid < stride) dc[tid] += dc[tid + stride];
// wait for all threads to complete add step
__syncthreads();
}
}
Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.

CUDA 2D Convolution kernel

I'm a beginner in CUDA and I'm trying to implement a Sobel Edge detection kernel.
I'm using this code for it but it doesn't work.
Can anyone tell me what is wrong with it. I just get some -1's and some really big values.
__global__ void EdgeDetect_Hor(int *gpu_Edge_Hor, int *gpu_P,
int *gpu_Hor, int W, int H)
{
int X = threadIdx.x;
int Y = threadIdx.y;
int sum = 0;
int k1, k2;
int min1, min2;
for (k1 = 0; k1 < 3; k1++)
for(k2 = 0; k2 <3;k2++)
sum += gpu_Hor[k1*3+k2]*gpu_P[(X-k1)*H+Y-k2];
gpu_Edge_Hor[X*H+Y] = sum/5000;
}
I call this kernel like this:
dim3 dimBlock(W,H);
dim3 dimGrid(1,1);
EdgeDetect_Hor<<<dimGrid, dimBlock>>>(gpu_Edge_Hor, gpu_P, gpu_Hor, W, H);
First, your problem is that you process image of 480x720 pixels. CUDA supports maximum size of thread block 1024 for compute capability 2.0 and greater and 512 for previous. So you can't execute so many threads in one block. The line dim3 dimBlock(W,H); is incorrect. You should divide your threads to several blocks.
Another problem is that CUDA process data in row-major order. So you should change you memory access pattern.
Right memory access pattern for 2D arrays in CUDA is
BaseAddress + width * Y + X
where
unsigned int X = blockIdx.x * blockDim.x + threadIdx.x;
unsigned int Y = blockIdx.y * blockDim.y + threadIdx.y;