CUDA kernel only work with 1D thread index - cuda

There is a weird problem. I have following code. When I call first function it does not give correct result. However, when I call the function2 (the second function) it works fine. It is so weird to me. Does anyone has any idea about the problem? Thanks!!!
__global__ void function(int w, class<double> C, float *result) {
int r = threadIdx.x + blockIdx.x * blockDim.x;
int c = threadIdx.y + blockIdx.y * blockDim.y;
int half_w = w /2;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
__global__ void function2(int w, class<double> C, float *result) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int half_w = w /2;
int r = tid / w;
int c = tid % w;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
UPDATE:
I use the function and function2 to draw an image. The pixel value is based on the distance between image center and current pixel position. Based on the distance, the class C getVal will calculate the value for the pixel. So, in the kernel, I just make every thread to calculate the distance and corresponding pixel value. The correct result is compared with CPU version. The function is just give some random value some very larger some very small. When I changed the result[c * w + r] = (float)C.getVal(dis) to result[c * w +r ] = 1.0f, the generated image seems does not change.
The image size is W x W, to launch function I set
dim3 grid_dim(w / 64 + 1, w / 64 + 1);
dim3 block_dim(64, 64);
function<<<grid_dim, block_dim>>>(W, C, cu_img);
To launch function2
function2<<<W / 128 + 1, 128>>>(W, C, cu_img)
Fixed:
I got the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I run cuds-memcheck, I can see the function2 does not even launched.

I solved the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I ran cuda-memcheck, I can see the function2 was not ever launched.

Related

NVIDIA CUDA YUV (NV12) to RGB conversion algorithm breakdown

I am trying to modify the original YUV->RGB kernel provided in sample code of NVIDIA Video SDK and I need help to understand some of its parts.
Here is the kernel code:
template<class YuvUnitx2, class Rgb, class RgbIntx2>
__global__ static void YuvToRgbKernel(uint8_t* pYuv, int nYuvPitch, uint8_t* pRgb, int nRgbPitch, int nWidth, int nHeight) {
int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
if (x + 1 >= nWidth || y + 1 >= nHeight) {
return;
}
uint8_t* pSrc = pYuv + x * sizeof(YuvUnitx2) / 2 + y * nYuvPitch;
uint8_t* pDst = pRgb + x * sizeof(Rgb) + y * nRgbPitch;
YuvUnitx2 l0 = *(YuvUnitx2*)pSrc;
YuvUnitx2 l1 = *(YuvUnitx2*)(pSrc + nYuvPitch);
YuvUnitx2 ch = *(YuvUnitx2*)(pSrc + (nHeight - y / 2) * nYuvPitch);
//YuvToRgbForPixel - returns rgba encoded in uint32_t (.d)
*(RgbIntx2*)pDst = RgbIntx2{
YuvToRgbForPixel<Rgb>(l0.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l0.y, ch.x, ch.y).d,
};
*(RgbIntx2*)(pDst + nRgbPitch) = RgbIntx2{
YuvToRgbForPixel<Rgb>(l1.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l1.y, ch.x, ch.y).d,
};
}
Here are my basic assumptions, some of them are possibly wrong:
NV12 has two planes, 1 for Luma and 2 for interleaved chroma.
The kernel tries to write 4 pixels at a time.
If assumption 2 is correct, the question is why same chroma (ch) values are used for all 4 pixels? And If I am wrong on 2, please explain what exactly happens here.
The Chroma-planes on NV12 or NV21 are subsampled by a factor of 2.
For every 2x2 macro pixel in the output there are 4 luma (Y) channels, 1 Cb and 1 Cr element.

How to evaluate memory time and compute time for CUDA kernel?

I was working on an algorithm in CUDA and wanted to understand the performance of my kernel so I could optimize it appropriately.
I am required to determine whether my kernel is compute bound or memory bound using source code modifications only? NVIDIA docs suggest I run the kernel without memory accesses to determine compute time and similarly run the kernel without any computations to determine memory time.
I do not know how to appropriately modify my source code so that I can achieve the above? How can you perform computations without memory access (or how can you compute a result without accessing the variables stored in the memory?). Could you suggest an example for the memory and computation case in the following code so I can work on modifying it completely myself...
__device__ inline float cndGPU(float d)
{
const float A1 = 0.31938153f;
const float A2 = -0.356563782f;
const float A3 = 1.781477937f;
const float A4 = -1.821255978f;
const float A5 = 1.330274429f;
const float RSQRT2PI = 0.39894228040143267793994605993438f;
float
K = 1.0f / (1.0f + 0.2316419f * fabsf(d));
float
cnd = RSQRT2PI * __expf(- 0.5f * d * d) *
(K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
if (d > 0)
cnd = 1.0f - cnd;
return cnd;
}
__device__ inline void BlackScholesBodyGPU(
float &CallResult,
float &PutResult,
float S, //Stock price
float X, //Option strike
float T, //Option years
float R, //Riskless rate
float V //Volatility rate
)
{
float sqrtT, expRT;
float d1, d2, CNDD1, CNDD2;
sqrtT = sqrtf(T);
d1 = (__logf(S / X) + (R + 0.5f * V * V) * T) / (V * sqrtT);
d2 = d1 - V * sqrtT;
CNDD1 = cndGPU(d1);
CNDD2 = cndGPU(d2);
//Calculate Call and Put simultaneously
expRT = __expf(- R * T);
CallResult = S * CNDD1 - X * expRT * CNDD2;
PutResult = X * expRT * (1.0f - CNDD2) - S * (1.0f - CNDD1);
}
How I see it.
If you have:
float cndGPU(float d) {
const float a = 1;
const float b = 2;
float c;
c = a + b + arr[d];
return c;
}
Checking compute time without memory access - literally write all your computing expressions into one and without using variables:
return 1 + 2 + 3; //just put some number that can be in arr[d]
Checking the memory access - literally the opposite:
`
const float a = 1;
const float b = 2;
float c;
c = arr[d]; //here we have our memory access
return c;

Count the number of cycles in a CUDA kernel

How can I count the number of cycles performed by a function like the following. Should I count straight forward the number of sums and muls and divs? Where can I check how many cycles an addition takes in CUDA?
__global__
void mandelbrotSet_per_element(Grayscale *image){
float minR = -2.0f, maxR = 1.0f;
float minI = -1.2f, maxI = minI + (maxR-minR) * c_rows / c_cols;
float realFactor = (maxR - minR) / (c_cols-1);
float imagFactor = (maxI - minI) / (c_rows-1);
bool isInSet;
float c_real, c_imag, z_real, z_imag;
int y = blockDim.y * blockIdx.y + threadIdx.y;
int x = blockDim.x * blockIdx.x + threadIdx.x;
while (y < c_rows){
while (x < c_cols) {
c_real = minR + x * realFactor;
c_imag = maxI - y * imagFactor;
z_real = c_real; z_imag = c_imag;
isInSet = true;
for (int k = 0; k < c_iterations; k++){
float z_real2 = z_real * z_real;
float z_imag2 = z_imag * z_imag;
if (z_real2 + z_imag2 > 4){
isInSet = false;
break;
}
z_imag = 2 * z_real * z_imag + c_imag;
z_real = z_real2 - z_imag2 + c_real;
}
if (isInSet) image[y*c_cols+x] = 255;
else image[y*c_cols+x] = 0;
x += blockDim.x * gridDim.x;
}
x = blockDim.x * blockIdx.x + threadIdx.x;
y += blockDim.y * gridDim.y;
}
}
Instruction throughput is described in the programming guide here
You can also try measuring a sequence of instructions using the native clock() function described here
The compiler tends to obscure actual counts of operations at the source code level (increasing or possibly decreasing apparent arithmetic intensity) so if you want to indentify exactly what the machine is doing you may want to inspect the ptx (nvcc -ptx ...) or possibly the machine assembly level code, called SASS, which you can extract from an executable using the cuobjdump utility.

Matrix multiplication using CUDA -- wrong results

I have following kernel code for matrix manipulation. Matrix A = 1*3 and Matrix B = 3*3 resultant Matrix C would be 1*3. In the following code the width would be 3.
__global__void MatrixMulKernel(float* d_M,float* d_N,float* d_P,int Width) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if(row>=Width || col>=Width){ // matrix range
return;
}
float P_val = 0.0f;
for (int k = 0; k < Width; ++k) {
float M_elem = d_M[row * Width + k];
float N_elem = d_N[k * Width + col];
P_val += M_elem * N_elem;
}
d_p[row*Width+col] = P_val;
}
I kernel code is called as follows
int block_size = 32;
dim3 dimGrid(Width/block_size, Width/block_size);
dim3 dimBlock(block_size, block size);
MatrixMulKernel<<<dimGrid, dimBlock>>>(d_M, d_N, d_P,3);
But I am getting wrong results. I am getting results as zero always.
Can anyone help me please.
The code looks likes its for multiplication of 2 square matrices of same size.
Width is the number of columns of the first matrix.
You have to provide this as an argument to the function.

CUDA kernel - nested for loop

Hello
I'm trying to write a CUDA kernel to perform the following piece of code.
for (n = 0; n < (total-1); n++)
{
a = values[n];
for ( i = n+1; i < total ; i++)
{
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}
This is what I have currently, but it does not seem to be giving the correct results? does anyone know what I'm doing wrong. Cheers
__global__ void calc(int total, float *values, float *newvalues){
float a,b,c;
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x){
a = values[n];
for(int i = n+1; i < total; i++){
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
}
}
Realize this problem in 2D and launch your kernel with 2D thread blocks. The total number of threads in x and y dimension will be equal to total . The kernel code should look like this:
__global__ void calc(float *values, float *newvalues, int total){
float a,b,c;
int n= blockIdx.y * blockDim.y + threadIdx.y;
int i= blockIdx.x * blockDim.x + threadIdx.x;
if (n>=total || i>=total)
return;
a = values[n];
b = values[i] - a;
c = b*b;
if( c < 10)
newvalues[i] = c;
// I don't know your problem statement but i think it should be like: newvalues[n*total+i] = c;
}
Update:
This is how you should call the kernel
dim3 block(16,16);
dim3 grid ( (total+15)/16, (total+15)/16 );
calc<<<grid,block>>>(float *val, float *newval, int T);
Also make sure you add this line in kernel (see updated kernel)
if (n>=total || i>=total)
return;
Update 2:
fixed blockIdy.y, correct is blockIdx.y
I'll probably be way wrong but the n < (total-1) check in
for (int n = idx; n < (total-1); n += blockDim.x*gridDim.x)
seems different than the original version.
Why don't you just remove the outter loop and start the kernel with as many threads as you need for this loop? It's a bit weird to have a loop that depends on your blockId. Normally you try to avoid these loops.
Secondly it seems to me that newvalues[i] can be overriden by different threads.