for loop in Return - function

I'm a beginner in C++ and i have a problem that i dont know how to solve it,
I have an int function that few parameters should be return on it:
int sphere(const float & X,const float & Y,const float & Z,
const int & Px, const int & Py, const int & Pz,
const int & diameterOfSphere, const int & number)
{
return pow(Px-X,2) + pow(Py+(diameterOfSphere * (number - 1))-Y,2)
+ pow(Pz-Z,2) <= pow(diameterOfSphere/2,2);
}
in this function, the integer "number" may should be start from 2 to for example 100. I need to do something that if i choose 100 for "number", the return statement should be repeated 99 times and separated by a plus ( + ).
for example i can do it manually but it is needed to write a lot of codes which is not logical
for example, i did it manually for just three times
return (pow(Px-X,2)+pow((Py+(diameterOfSphere * 2))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2))
+ (pow(Px-X,2)+pow((Py+(diameterOfSphere * 3))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2))
+ (pow(Px-X,2)+pow((Py+(diameterOfSphere * 4))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2))
+ (pow(Px-X,2)+pow((Py+(diameterOfSphere * 5))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2)) ;
Is there any easier way? I know i have to use a loop but i dont know how to do it in this case
Thanks a lot

Don't use pow() to do spheres squares, pow() is an exponential function that is quite slow. Break your formula and format your lines to make code readable. Your point's coordinates are integer, is that intentional? This variant is not only more readable, it's more likely to be optimized by compiler:
int sphere(const float & X,const float & Y, const float & Z,
const int & Px, const int & Py, const int & Pz,
const int & diameterOfSphere, const int & number)
{
const float dx = Px - X;
const float dy = Py + diameterOfSphere * (number - 1) - Y;
const float dz = Pz - Z;
const float D = dx*dx + dy*dy + dz*dz;
return D <= 0.25 * diameterOfSphere*diameterOfSphere;
}
Now if I understood you right, you need a recursion or a loop that emulates recursion. You actually can call function from itself, do you know that?
int sphere(const float & X,const float & Y, const float & Z,
const int & Px, const int & Py, const int & Pz,
const int & diameterOfSphere, const int & number)
{
const float dx = Px - X;
const float dy = Py + diameterOfSphere * (number - 1) - Y;
const float dz = Pz - Z;
const float D = dx*dx + dy*dy + dz*dz;
if(!(number>0))
return 0;
return D <= 0.25 * diameterOfSphere*diameterOfSphere
+ sphere(X,Y,Z,Px,Py,Pz,diameterOfSphere, number -1);
}
Negative side of recursion a) each function call fills stack with variables and parameters stored b) there is an extra call that returns immediately.
Py + diameterOfSphere * (number - 1) - Y expression throws me back, is that a mistake? Pretty much it almost never would cause comparison to be true. And it's still not clear what you're trying to do with those comparisons. So, while I modified code so it would be equal to your idea, it looks chaotic\senseless. The >= or <= would return 1 or 0 as result. Or did you mean this?
return ( D <= 0.25 * diameterOfSphere*diameterOfSphere )
+ sphere(X,Y,Z,Px,Py,Pz,diameterOfSphere, number -1);

Related

NVIDIA CUDA YUV (NV12) to RGB conversion algorithm breakdown

I am trying to modify the original YUV->RGB kernel provided in sample code of NVIDIA Video SDK and I need help to understand some of its parts.
Here is the kernel code:
template<class YuvUnitx2, class Rgb, class RgbIntx2>
__global__ static void YuvToRgbKernel(uint8_t* pYuv, int nYuvPitch, uint8_t* pRgb, int nRgbPitch, int nWidth, int nHeight) {
int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
if (x + 1 >= nWidth || y + 1 >= nHeight) {
return;
}
uint8_t* pSrc = pYuv + x * sizeof(YuvUnitx2) / 2 + y * nYuvPitch;
uint8_t* pDst = pRgb + x * sizeof(Rgb) + y * nRgbPitch;
YuvUnitx2 l0 = *(YuvUnitx2*)pSrc;
YuvUnitx2 l1 = *(YuvUnitx2*)(pSrc + nYuvPitch);
YuvUnitx2 ch = *(YuvUnitx2*)(pSrc + (nHeight - y / 2) * nYuvPitch);
//YuvToRgbForPixel - returns rgba encoded in uint32_t (.d)
*(RgbIntx2*)pDst = RgbIntx2{
YuvToRgbForPixel<Rgb>(l0.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l0.y, ch.x, ch.y).d,
};
*(RgbIntx2*)(pDst + nRgbPitch) = RgbIntx2{
YuvToRgbForPixel<Rgb>(l1.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l1.y, ch.x, ch.y).d,
};
}
Here are my basic assumptions, some of them are possibly wrong:
NV12 has two planes, 1 for Luma and 2 for interleaved chroma.
The kernel tries to write 4 pixels at a time.
If assumption 2 is correct, the question is why same chroma (ch) values are used for all 4 pixels? And If I am wrong on 2, please explain what exactly happens here.
The Chroma-planes on NV12 or NV21 are subsampled by a factor of 2.
For every 2x2 macro pixel in the output there are 4 luma (Y) channels, 1 Cb and 1 Cr element.

Error "expression must have integral or enum type" in thats code:

Error "expression must have integral or enum type" in thats code:
__global__ void VectorKernel(float *a, float *b, float *c, int n)
{
int i = threadIdx.x;
float y = 0, z = 0;
if (i < n)
y = (b-a) / n;
for (float j = y; j <= n ; j++) {
z = (((j+y) - j) / 6) * function(j) + 4 * (function((j + (y+j)) / 2)) + function(y+j);
c = c + z;
}
}
the error happen in "z", in stretch:
c = c + z;
(i'm beginner in CUDA programming)
c is a pointer. Pointer arithmetic requires a pointer and an integer type expression.
If you want to to add z to the float pointed to by c you should change the expression to:
*c = *c + z;
When you write c = c + z and get an error like this, you should suspect your types are mismatched.
c is a float * and z is a float which are not assignable.
What you probably want to do is store the result of *c + z in the memory location pointed at by c, in which case you'd write:
*c = *c + z.

Unsigned Binary Subtraction ( 0011 - 1111 = 0100 ?)

I'm currently building a 4-bit bitwise adder-subtractor for a project at university but the subtraction doesn't all make sense to me. I have it working for positive values 1100 - 0100 = 1000 but I'm not sure that my answer is correct when the result is negative. As can be seen in the title, when I do 0011 - 1111 for unsigned subtraction my result is 0100. Could someone verify to me that this is the correct answer and that neither the overflow or carry should be 1 when this calculation is carried out?
Well, the short answer is no. Since you are working with unsigned binary 0100 is equivalent to 4. What might work is if you took the modulus of the returned result. To do that you could something like this
public class Modul {
public static void main(String[] args) {
int x = 0b0011;
int y = 0b1111;
int z = x - y;
int s1 = Integer.MAX_VALUE+1;
int s2 = s1*2;
int finalInt = s2-z;
System.out.println(x + "+" + y + "=" + z);
System.out.println(Integer.toBinaryString(x) + "-" + Integer.toBinaryString(y) + "=" + Integer.toBinaryString(finalInt));
}
}
This will return 1100 which is equivalent to 12
I am unaware of why you have tagged twos-complement since you are using unsigned binary however if you want it with twos compliment you can use this little program that will return a 32 bit signed binary number
public class Test {
public static void main(String[] args) {
int x = 0b0011;
int y = 0b1111;
int z = x - y;
System.out.println(x + "+" + y + "=" + z);
//3-15=-12
System.out.println(Integer.toBinaryString(x) + "-" + Integer.toBinaryString(y) + "=" + Integer.toBinaryString(z));
//0011-1111=11111111111111111111111111110100
}
}
I believe the last four bits on this string is where you are getting 0100 from

How to evaluate memory time and compute time for CUDA kernel?

I was working on an algorithm in CUDA and wanted to understand the performance of my kernel so I could optimize it appropriately.
I am required to determine whether my kernel is compute bound or memory bound using source code modifications only? NVIDIA docs suggest I run the kernel without memory accesses to determine compute time and similarly run the kernel without any computations to determine memory time.
I do not know how to appropriately modify my source code so that I can achieve the above? How can you perform computations without memory access (or how can you compute a result without accessing the variables stored in the memory?). Could you suggest an example for the memory and computation case in the following code so I can work on modifying it completely myself...
__device__ inline float cndGPU(float d)
{
const float A1 = 0.31938153f;
const float A2 = -0.356563782f;
const float A3 = 1.781477937f;
const float A4 = -1.821255978f;
const float A5 = 1.330274429f;
const float RSQRT2PI = 0.39894228040143267793994605993438f;
float
K = 1.0f / (1.0f + 0.2316419f * fabsf(d));
float
cnd = RSQRT2PI * __expf(- 0.5f * d * d) *
(K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
if (d > 0)
cnd = 1.0f - cnd;
return cnd;
}
__device__ inline void BlackScholesBodyGPU(
float &CallResult,
float &PutResult,
float S, //Stock price
float X, //Option strike
float T, //Option years
float R, //Riskless rate
float V //Volatility rate
)
{
float sqrtT, expRT;
float d1, d2, CNDD1, CNDD2;
sqrtT = sqrtf(T);
d1 = (__logf(S / X) + (R + 0.5f * V * V) * T) / (V * sqrtT);
d2 = d1 - V * sqrtT;
CNDD1 = cndGPU(d1);
CNDD2 = cndGPU(d2);
//Calculate Call and Put simultaneously
expRT = __expf(- R * T);
CallResult = S * CNDD1 - X * expRT * CNDD2;
PutResult = X * expRT * (1.0f - CNDD2) - S * (1.0f - CNDD1);
}
How I see it.
If you have:
float cndGPU(float d) {
const float a = 1;
const float b = 2;
float c;
c = a + b + arr[d];
return c;
}
Checking compute time without memory access - literally write all your computing expressions into one and without using variables:
return 1 + 2 + 3; //just put some number that can be in arr[d]
Checking the memory access - literally the opposite:
`
const float a = 1;
const float b = 2;
float c;
c = arr[d]; //here we have our memory access
return c;

CUDA kernel only work with 1D thread index

There is a weird problem. I have following code. When I call first function it does not give correct result. However, when I call the function2 (the second function) it works fine. It is so weird to me. Does anyone has any idea about the problem? Thanks!!!
__global__ void function(int w, class<double> C, float *result) {
int r = threadIdx.x + blockIdx.x * blockDim.x;
int c = threadIdx.y + blockIdx.y * blockDim.y;
int half_w = w /2;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
__global__ void function2(int w, class<double> C, float *result) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
int half_w = w /2;
int r = tid / w;
int c = tid % w;
if (r < w && c < w) {
double dis = sort((double)(r - half_w) * (r - half_w) + (double)(c_half_w) * (c - half_w));
result[c * w + r] = (float)C.getVal(dis);
}
}
UPDATE:
I use the function and function2 to draw an image. The pixel value is based on the distance between image center and current pixel position. Based on the distance, the class C getVal will calculate the value for the pixel. So, in the kernel, I just make every thread to calculate the distance and corresponding pixel value. The correct result is compared with CPU version. The function is just give some random value some very larger some very small. When I changed the result[c * w + r] = (float)C.getVal(dis) to result[c * w +r ] = 1.0f, the generated image seems does not change.
The image size is W x W, to launch function I set
dim3 grid_dim(w / 64 + 1, w / 64 + 1);
dim3 block_dim(64, 64);
function<<<grid_dim, block_dim>>>(W, C, cu_img);
To launch function2
function2<<<W / 128 + 1, 128>>>(W, C, cu_img)
Fixed:
I got the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I run cuds-memcheck, I can see the function2 does not even launched.
I solved the problem. I assigned too many threads to one block. The max threads in one block is 1024 in my device. Actually, when I ran cuda-memcheck, I can see the function2 was not ever launched.