Related
I am trying to modify the original YUV->RGB kernel provided in sample code of NVIDIA Video SDK and I need help to understand some of its parts.
Here is the kernel code:
template<class YuvUnitx2, class Rgb, class RgbIntx2>
__global__ static void YuvToRgbKernel(uint8_t* pYuv, int nYuvPitch, uint8_t* pRgb, int nRgbPitch, int nWidth, int nHeight) {
int x = (threadIdx.x + blockIdx.x * blockDim.x) * 2;
int y = (threadIdx.y + blockIdx.y * blockDim.y) * 2;
if (x + 1 >= nWidth || y + 1 >= nHeight) {
return;
}
uint8_t* pSrc = pYuv + x * sizeof(YuvUnitx2) / 2 + y * nYuvPitch;
uint8_t* pDst = pRgb + x * sizeof(Rgb) + y * nRgbPitch;
YuvUnitx2 l0 = *(YuvUnitx2*)pSrc;
YuvUnitx2 l1 = *(YuvUnitx2*)(pSrc + nYuvPitch);
YuvUnitx2 ch = *(YuvUnitx2*)(pSrc + (nHeight - y / 2) * nYuvPitch);
//YuvToRgbForPixel - returns rgba encoded in uint32_t (.d)
*(RgbIntx2*)pDst = RgbIntx2{
YuvToRgbForPixel<Rgb>(l0.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l0.y, ch.x, ch.y).d,
};
*(RgbIntx2*)(pDst + nRgbPitch) = RgbIntx2{
YuvToRgbForPixel<Rgb>(l1.x, ch.x, ch.y).d,
YuvToRgbForPixel<Rgb>(l1.y, ch.x, ch.y).d,
};
}
Here are my basic assumptions, some of them are possibly wrong:
NV12 has two planes, 1 for Luma and 2 for interleaved chroma.
The kernel tries to write 4 pixels at a time.
If assumption 2 is correct, the question is why same chroma (ch) values are used for all 4 pixels? And If I am wrong on 2, please explain what exactly happens here.
The Chroma-planes on NV12 or NV21 are subsampled by a factor of 2.
For every 2x2 macro pixel in the output there are 4 luma (Y) channels, 1 Cb and 1 Cr element.
I'm a beginner in C++ and i have a problem that i dont know how to solve it,
I have an int function that few parameters should be return on it:
int sphere(const float & X,const float & Y,const float & Z,
const int & Px, const int & Py, const int & Pz,
const int & diameterOfSphere, const int & number)
{
return pow(Px-X,2) + pow(Py+(diameterOfSphere * (number - 1))-Y,2)
+ pow(Pz-Z,2) <= pow(diameterOfSphere/2,2);
}
in this function, the integer "number" may should be start from 2 to for example 100. I need to do something that if i choose 100 for "number", the return statement should be repeated 99 times and separated by a plus ( + ).
for example i can do it manually but it is needed to write a lot of codes which is not logical
for example, i did it manually for just three times
return (pow(Px-X,2)+pow((Py+(diameterOfSphere * 2))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2))
+ (pow(Px-X,2)+pow((Py+(diameterOfSphere * 3))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2))
+ (pow(Px-X,2)+pow((Py+(diameterOfSphere * 4))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2))
+ (pow(Px-X,2)+pow((Py+(diameterOfSphere * 5))-Y,2)+pow(Pz-Z,2)
<= pow(diameterOfSphere/2,2)) ;
Is there any easier way? I know i have to use a loop but i dont know how to do it in this case
Thanks a lot
Don't use pow() to do spheres squares, pow() is an exponential function that is quite slow. Break your formula and format your lines to make code readable. Your point's coordinates are integer, is that intentional? This variant is not only more readable, it's more likely to be optimized by compiler:
int sphere(const float & X,const float & Y, const float & Z,
const int & Px, const int & Py, const int & Pz,
const int & diameterOfSphere, const int & number)
{
const float dx = Px - X;
const float dy = Py + diameterOfSphere * (number - 1) - Y;
const float dz = Pz - Z;
const float D = dx*dx + dy*dy + dz*dz;
return D <= 0.25 * diameterOfSphere*diameterOfSphere;
}
Now if I understood you right, you need a recursion or a loop that emulates recursion. You actually can call function from itself, do you know that?
int sphere(const float & X,const float & Y, const float & Z,
const int & Px, const int & Py, const int & Pz,
const int & diameterOfSphere, const int & number)
{
const float dx = Px - X;
const float dy = Py + diameterOfSphere * (number - 1) - Y;
const float dz = Pz - Z;
const float D = dx*dx + dy*dy + dz*dz;
if(!(number>0))
return 0;
return D <= 0.25 * diameterOfSphere*diameterOfSphere
+ sphere(X,Y,Z,Px,Py,Pz,diameterOfSphere, number -1);
}
Negative side of recursion a) each function call fills stack with variables and parameters stored b) there is an extra call that returns immediately.
Py + diameterOfSphere * (number - 1) - Y expression throws me back, is that a mistake? Pretty much it almost never would cause comparison to be true. And it's still not clear what you're trying to do with those comparisons. So, while I modified code so it would be equal to your idea, it looks chaotic\senseless. The >= or <= would return 1 or 0 as result. Or did you mean this?
return ( D <= 0.25 * diameterOfSphere*diameterOfSphere )
+ sphere(X,Y,Z,Px,Py,Pz,diameterOfSphere, number -1);
I know it's associative and commutative:
That is,
(~x1 + ~x2) + ~x3 = ~x1 + (~x2 + ~x3)
and
~x1 + ~x2 = ~x2 + ~x1
However, for the cases I tried, it doesn't seem to be distributive, i.e,
~x1 + ~x2 != ~(x1 + x2)
Is this true? Is there a proof?
I have C code as follows:
int n1 = 5;
int n2 = 3;
result = ~n1 + ~n2 == ~(n1 + n2);
int calc = ~n1;
int calc0 = ~n2;
int calc1 = ~n1 + ~n2;
int calc2 = ~(n1 + n2);
printf("(Part B: n1 is %d, n2 is %d\n", n1, n2);
printf("Part B: (calc is: %d and calc0 is: %d\n", calc, calc0);
printf("Part B: (calc1 is: %d and calc2 is: %d\n", calc1, calc2);
printf("Part B: (~%d + ~%d) == ~(%d + %d) evaluates to %d\n", n1, n2, n1, n2, result);
Which gives the following output:
Part B: (n1 is 5, n2 is 3
Part B: (calc is: -6 and calc0 is: -4
Part B: (calc1 is: -10 and calc2 is: -9
Part B: (~5 + ~3) == ~(5 + 3) evaluates to 0
[I know this is really old, but I had the same question, and since the top answers were contradictory...]
The one's compliment is indeed distributive over addition. The problem in your code (and Kaganar's kind but incorrect answer) is that you are dropping the carry bits -- you can do that in two-s compliment, but not in one's compliment.
Whatever you use to store the sum needs more memory space than what you are summing so that you don't drop your carry bits. Then fold the carry bits back into the number of bits you are using to store your operands to get the proper sum. This is called an "end-around carry" in one's compliment arithmetic.
From Wikipedia article ( https://en.wikipedia.org/wiki/Signed_number_representations#Ones'_complement ):
To add two numbers represented in this system, one does a conventional binary addition, but it is then necessary to do an end-around carry: that is, add any resulting carry back into the resulting sum. To see why this is necessary, consider the following example showing the case of the addition of −1 (11111110) to +2 (00000010):
binary decimal
11111110 –1
+ 00000010 +2
─────────── ──
1 00000000 0 ← Not the correct answer
1 +1 ← Add carry
─────────── ──
00000001 1 ← Correct answer
In the previous example, the first binary addition gives 00000000, which is incorrect. The correct result (00000001) only appears when the carry is added back in.
I changed your code a bit to make it easier to do the math myself for a sanity check and tested. It may require a bit more thought using signed integer datatypes or to account for end-around borrowing instead of carrying. I didn't go that far since my application is all about checksums (i.e. unsigned, and addition only).
unsigned short n1 = 5; //using 16-bit unsigned integers
unsigned short n2 = 3;
unsigned long oc_added = (unsigned short)~n1 + (unsigned short)~n2; //32bit
/* Fold the carry bits into 16 bits */
while (oc_added >> 16)
oc_added = (oc_added & 0xffff) + (oc_added >> 16);
unsigned long oc_sum = ~(n1 + n2); //oc_sum has 32 bits (room for carry)
/* Fold the carry bits into 16 bits */
while (oc_sum >> 16)
oc_sum = (oc_sum & 0xffff) + (oc_sum >> 16);
int result = oc_added == oc_sum;
unsigned short calc = ~n1;
unsigned short calc0 = ~n2;
unsigned short calc1 = ~n1 + ~n2; //loses a carry bit
unsigned short calc2 = ~(n1 + n2);
printf("(Part B: n1 is %d, n2 is %d\n", n1, n2);
printf("Part B: (calc is: %d and calc0 is: %d\n", calc, calc0);
printf("Part B: (calc1 is: %d and calc2 is: %d\n", calc1, calc2);
printf("Part B: (~%d + ~%d) == ~(%d + %d) evaluates to %d\n", n1, n2, n1, n2, result);
Check out the Wikiepdia article on Ones' complement. Addition in one's complement has end-carry around where you must add the overflow bit to the lowest bit.
Since ~ (NOT) is equivalent to - (NEGATE) in ones' complement, we can re-write it as:
-x1 + -x2 = -(x1 + x2)
which is correct.
One's compliment is used to represent negative and positive numbers in fixed-width registers. To distribute over addition the following must apply: ~a + ~b = ~(a + b). The OP states + represents adding 'two binary numbers'. This itself is vague, however if we take it to mean adding unsigned binary numbers, then no, one's compliment is not distributive.
Note that there are two zeroes in one's compliment: all bits are ones or all bits are zeroes.
Check to see that ~0 + ~0 != ~(0 + 0):
~0 is zero. However, ~0 is represented by all ones. Adding it to itself is doubling it -- the same as a left shift -- and thus introduces a zero in the right hand digit. But that is no longer one of the two zeroes.
However, 0 is zero, and 0 + 0 is also zero, and thus so is ~(0 + 0). But the left side isn't zero, so to distribution does not hold.
On the other hand... Consider two's compliment: flip all bits and add one. If care is taken to treat negatives in one's compliment specially, then that version of 'binary addition' is similar to two's compliment and is distributive as you end up with a familiar quotient ring just like in two's compliment.
The aforementioned Wikipedia article has more details on handling addition to allow for expected arithmetic behavior.
From De Morgan's laws:
~(x1 + x2) = ~x1 * ~x2
Its my first post here, so if I commit some mistake please let me know.
I have been given a assignment, and a part of it requires the binary representation of n'th Fibonacci number.
Constraints-
) C++ has to be used as prog. language.
) n'th fib. number has to be calculated in lg(n) time.
I have a function but it works on integers. But the maximum value for which I have to do calculations is about 10^6. So, I am badly stuck here.
Whatever I know, I can't apply in this scenario, because I can generate n'th fib. using strings but that will have linear time complexity.
following is the function,
void multiply(long int F[2][2], long int M[2][2]);
void power(long int F[2][2], long int n);
// Function to Calculate n'th fibonacci in log(n) time
long int fib(long int n)
{
long int F[2][2] = {{1,1},{1,0}};
if(n == 0)
return 0;
power(F, n-1);
return F[0][0];
}
void power(long int F[2][2], long int n)
{
if( n == 0 || n == 1)
return;
long int M[2][2] = {{1,1},{1,0}};
power(F, n/2);
multiply(F, F);
if( n%2 != 0 )
multiply(F, M);
}
void multiply(long int F[2][2], long int M[2][2])
{
long int x = (F[0][0]*M[0][0])%mod + (F[0][1]*M[1][0])%mod;
long int y = (F[0][0]*M[0][1])%mod + (F[0][1]*M[1][1])%mod;
long int z = (F[1][0]*M[0][0])%mod + (F[1][1]*M[1][0])%mod;
long int w = (F[1][0]*M[0][1])%mod + (F[1][1]*M[1][1])%mod;
F[0][0] = x;
F[0][1] = y;
F[1][0] = z;
F[1][1] = w;
}
int main(){
int n; cin >> n; cout << fib(n)<<endl; getchar();
}
As it can be seen, only predefined data types can be used in this function.
Since this is homework, I'll only give you little hints.
The two problems are unrelated, so you need two methods: toBinary and fib. toBinary (fib (n)); would be your solution.
For solving the toBinary part, division and modulo are useful and can be called recursively.
If you calculate fib (n) as fib (n-1) + fib (n-2), there is a trap to step into, that when you calculate fib (n-1) as fib (n-2) + fib (n-3), you end up calculating fib (n-2) twice, fib (n-3) three times and so on.
Instead, you should start from (0 + 1) and step upwards, passing already calculated forward.
After a short test, I see how fast the Fibonacci numbers are growing. Do you have access to Ints of arbitrary size, or are you expected to use preallocated arrays?
Then you would need an add method, which takes the lower and the higher number as array of Integers or Booleans, and creates the sum in the lower array, which then becomes the upper array.
update:
Since you solved the problem, I feel free to post my solution for reference, written in Scala:
import annotation._
/**
add two arrays recursively. carry the position pos and the overrun
overrun=0 = 1 0 1 0 1
Sum low | next | Sum
0 1 | overrun | %2
high 0| 0 1 1 2 | 0 0 0 1 | 0 1 1 0
1| 1 2 2 3 | 0 1 1 1 | 1 0 0 1
*/
#tailrec
def add (low: Array[Int], high: Array[Int], pos: Int = 0, overrun: Int = 0): Array[Int] = {
if (pos == higher.size) {
if (overrun == 0) low else sys.error ("overrun!")
} else {
val sum = low (pos) + high (pos) + overrun
low (pos) = (sum % 2)
add (low, high, pos + 1, if (sum > 1) 1 else 0)
}
}
/** call cnt (example: 5) steps of
fib (5, 0, 1),
fib (4, 1, 1),
fib (3, 1, 2),
fib (2, 2, 3),
fib (1, 3, 5),
fib (0, 5, 8) */
#tailrec
def fib (cnt: Int, low: Array[Int], high: Array[Int]): Array[Int] = {
if (cnt == 0) low else fib (cnt - 1, high, add (low, high)) }
/** generate 2 Arrays, size dependent on n of about 0.7*n + 1, big enough to
hold values and result. Result has to be printed in reverse order, from the highest bit
*/
def fibonacci (n: Int) = {
val lower = Array.fill (n * 7 / 10 + 1)(0) // [...000]
val higher = Array.fill (n * 7 / 10 + 1)(0) // [...000]
higher (0) = 1 // [...001]
val res = fib (n, lower, higher)
res.reverse.foreach (print)
println ()
res
}
fibonacci (n)
For fibonacci (10000) I get a result of nearly 7000 binary digits, and the relation 10/7 is constant, so the millionth Fibonacci digit will have about 1.4 M digits.
The better method would be to use Matrix Exponentiation, which would calculate n'th fib. in lg(n) time. ( usefull for various online coding contests) See Method 4 of This post.
If I have a integer number n, how can I find the next number k > n such that k = 2^i, with some i element of N by bitwise shifting or logic.
Example: If I have n = 123, how can I find k = 128, which is a power of two, and not 124 which is only divisible by two. This should be simple, but it eludes me.
For 32-bit integers, this is a simple and straightforward route:
unsigned int n;
n--;
n |= n >> 1; // Divide by 2^k for consecutive doublings of k up to 32,
n |= n >> 2; // and then or the results.
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n++; // The result is a number of 1 bits equal to the number
// of bits in the original number, plus 1. That's the
// next highest power of 2.
Here's a more concrete example. Let's take the number 221, which is 11011101 in binary:
n--; // 1101 1101 --> 1101 1100
n |= n >> 1; // 1101 1100 | 0110 1110 = 1111 1110
n |= n >> 2; // 1111 1110 | 0011 1111 = 1111 1111
n |= n >> 4; // ...
n |= n >> 8;
n |= n >> 16; // 1111 1111 | 1111 1111 = 1111 1111
n++; // 1111 1111 --> 1 0000 0000
There's one bit in the ninth position, which represents 2^8, or 256, which is indeed the next largest power of 2. Each of the shifts overlaps all of the existing 1 bits in the number with some of the previously untouched zeroes, eventually producing a number of 1 bits equal to the number of bits in the original number. Adding one to that value produces a new power of 2.
Another example; we'll use 131, which is 10000011 in binary:
n--; // 1000 0011 --> 1000 0010
n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011
n |= n >> 2; // 1100 0011 | 0011 0000 = 1111 0011
n |= n >> 4; // 1111 0011 | 0000 1111 = 1111 1111
n |= n >> 8; // ... (At this point all bits are 1, so further bitwise-or
n |= n >> 16; // operations produce no effect.)
n++; // 1111 1111 --> 1 0000 0000
And indeed, 256 is the next highest power of 2 from 131.
If the number of bits used to represent the integer is itself a power of 2, you can continue to extend this technique efficiently and indefinitely (for example, add a n >> 32 line for 64-bit integers).
There is actually a assembly solution for this (since the 80386 instruction set).
You can use the BSR (Bit Scan Reverse) instruction to scan for the most significant bit in your integer.
bsr scans the bits, starting at the
most significant bit, in the
doubleword operand or the second word.
If the bits are all zero, ZF is
cleared. Otherwise, ZF is set and the
bit index of the first set bit found,
while scanning in the reverse
direction, is loaded into the
destination register
(Extracted from: http://dlc.sun.com/pdf/802-1948/802-1948.pdf)
And than inc the result with 1.
so:
bsr ecx, eax //eax = number
jz #zero
mov eax, 2 // result set the second bit (instead of a inc ecx)
shl eax, ecx // and move it ecx times to the left
ret // result is in eax
#zero:
xor eax, eax
ret
In newer CPU's you can use the much faster lzcnt instruction (aka rep bsr). lzcnt does its job in a single cycle.
A more mathematical way, without loops:
public static int ByLogs(int n)
{
double y = Math.Floor(Math.Log(n, 2));
return (int)Math.Pow(2, y + 1);
}
Here's a logic answer:
function getK(int n)
{
int k = 1;
while (k < n)
k *= 2;
return k;
}
Here's John Feminella's answer implemented as a loop so it can handle Python's long integers:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
n -= 1 # greater than OR EQUAL TO n
shift = 1
while (n+1) & n: # n+1 is not a power of 2 yet
n |= n >> shift
shift <<= 1
return n + 1
It also returns faster if n is already a power of 2.
For Python >2.7, this is simpler and faster for most N:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
return 2**(n-1).bit_length()
This answer is based on constexpr to prevent any computing at runtime when the function parameter is passed as const
Greater than / Greater than or equal to
The following snippets are for the next number k > n such that k = 2^i
(n=123 => k=128, n=128 => k=256) as specified by OP.
If you want the smallest power of 2 greater than OR equal to n then just replace __builtin_clzll(n) by __builtin_clzll(n-1) in the following snippets.
C++11 using GCC or Clang (64 bits)
#include <cstdint> // uint64_t
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * 8 - __builtin_clzll(n));
}
Enhancement using CHAR_BIT as proposed by martinec
#include <cstdint>
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * CHAR_BIT - __builtin_clzll(n));
}
C++17 using GCC or Clang (from 8 to 128 bits)
#include <cstdint>
template <typename T>
constexpr T nextPowerOfTwo64 (T n)
{
T clz = 0;
if constexpr (sizeof(T) <= 32)
clz = __builtin_clzl(n); // unsigned long
else if (sizeof(T) <= 64)
clz = __builtin_clzll(n); // unsigned long long
else { // See https://stackoverflow.com/a/40528716
uint64_t hi = n >> 64;
uint64_t lo = (hi == 0) ? n : -1ULL;
clz = _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
return T{1} << (CHAR_BIT * sizeof(T) - clz);
}
Other compilers
If you use a compiler other than GCC or Clang, please visit the Wikipedia page listing the Count Leading Zeroes bitwise functions:
Visual C++ 2005 => Replace __builtin_clzl() by _BitScanForward()
Visual C++ 2008 => Replace __builtin_clzl() by __lzcnt()
icc => Replace __builtin_clzl() by _bit_scan_forward
GHC (Haskell) => Replace __builtin_clzl() by countLeadingZeros()
Contribution welcome
Please propose improvements within the comments. Also propose alternative for the compiler you use, or your programming language...
See also similar answers
nulleight's answer
ydroneaud's answer
Here's a wild one that has no loops, but uses an intermediate float.
// compute k = nextpowerof2(n)
if (n > 1)
{
float f = (float) n;
unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
k = t << (t < n);
}
else k = 1;
This, and many other bit-twiddling hacks, including the on submitted by John Feminella, can be found here.
assume x is not negative.
int pot = Integer.highestOneBit(x);
if (pot != x) {
pot *= 2;
}
If you use GCC, MinGW or Clang:
template <typename T>
T nextPow2(T in)
{
return (in & (T)(in - 1)) ? (1U << (sizeof(T) * 8 - __builtin_clz(in))) : in;
}
If you use Microsoft Visual C++, use function _BitScanForward() to replace __builtin_clz().
function Pow2Thing(int n)
{
x = 1;
while (n>0)
{
n/=2;
x*=2;
}
return x;
}
Bit-twiddling, you say?
long int pow_2_ceil(long int t) {
if (t == 0) return 1;
if (t != (t & -t)) {
do {
t -= t & -t;
} while (t != (t & -t));
t <<= 1;
}
return t;
}
Each loop strips the least-significant 1-bit directly. N.B. This only works where signed numbers are encoded in two's complement.
What about something like this:
int pot = 1;
for (int i = 0; i < 31; i++, pot <<= 1)
if (pot >= x)
break;
You just need to find the most significant bit and shift it left once. Here's a Python implementation. I think x86 has an instruction to get the MSB, but here I'm implementing it all in straight Python. Once you have the MSB it's easy.
>>> def msb(n):
... result = -1
... index = 0
... while n:
... bit = 1 << index
... if bit & n:
... result = index
... n &= ~bit
... index += 1
... return result
...
>>> def next_pow(n):
... return 1 << (msb(n) + 1)
...
>>> next_pow(1)
2
>>> next_pow(2)
4
>>> next_pow(3)
4
>>> next_pow(4)
8
>>> next_pow(123)
128
>>> next_pow(222)
256
>>>
Forget this! It uses loop !
unsigned int nextPowerOf2 ( unsigned int u)
{
unsigned int v = 0x80000000; // supposed 32-bit unsigned int
if (u < v) {
while (v > u) v = v >> 1;
}
return (v << 1); // return 0 if number is too big
}
private static int nextHighestPower(int number){
if((number & number-1)==0){
return number;
}
else{
int count=0;
while(number!=0){
number=number>>1;
count++;
}
return 1<<count;
}
}
// n is the number
int min = (n&-n);
int nextPowerOfTwo = n+min;
#define nextPowerOf2(x, n) (x + (n-1)) & ~(n-1)
or even
#define nextPowerOf2(x, n) x + (x & (n-1))