I have a 64 bit number (but only the 42 low order bits are used) and need to computer the sum of the 4 bits at n, n+m, n+m*2 and n+m*3 (note: anything that can produce a sum >4 is invalid) for some fixed m and every value of n that places all the bits in the number
as an example using m=3 and given the 16-bit number
0010 1011 0110 0001
I need to compute
2, 3, 1, 2, 3, 0, 3
Does anyone have any (cool) ideas for ways to do this? I'm fine with bit twiddling.
My current thought is to make bit shifted copies of the input to align the values to be summed and then build a logic tree to do a 4x 1bit adder.
v1 = In;
v2 = In<<3;
v3 = In<<6;
v4 = In<<9;
a1 = v1 ^ v2;
a2 = v1 & v2;
b1 = v3 ^ v4;
b2 = v3 & v4;
c2 = a1 & b1;
d2 = a2 ^ b2;
o1 = a1 ^ b1;
o2 = c2 ^ d2;
o4 = a2 & b2;
This does end up with the bits of the result spread across 3 different ints but oh well.
edit: as it happens I need the histogram of the sums so doing a bit-count of o4, o2&o1, o2 and o1 gives me what I want.
a second solution uses a perfect hash function
arr = [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4];
for(int i = 0; i < N; i++)
{
out[i] = arr[(In & 0b1001001001) % 30];
In >>= 1;
}
This works by noting that the 4 selected bits can only take on 16 patterns and that (by guess and check) they can be hashed into 0-15 using mod 30. From there, a table of computed values gives the needed sum. As it happens only 3 of the 4 strides I need work this way.
p.s.
Correct trumps fast. Fast trumps clear. I expect to be running this millions of time.
Maybe I am crazy, but I am having fun :D
This solution is based upon the usage of data parallelism and faking a vector cpu without actually using SSE intrinsics or anything similar.
unsigned short out[64];
const unsigned long long mask = 0x0249024902490249ul;
const unsigned long long shiftmask = 0x0001000100010001ul;
unsigned long long t = (unsigned short)(in >> 38) | (unsigned long long)(unsigned short)(in >> 39) > 40) > 41) << 48;
t &= mask;
*((unsigned long long*)(out + 38)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);
[... snipsnap ...]
t = (unsigned short)(in >> 2) | (unsigned long long)(unsigned short)(in >> 3) > 4) > 5) << 48;
t &= mask;
*((unsigned long long*)(out + 2)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);
t = (unsigned short)in | (unsigned long long)(unsigned short)(in >> 1) << 16;
t &= mask;
*((unsigned int*)out) = (unsigned int)((t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask));
By reordering the computations, we can further reduce the execution time significantly, since it drastically reduces the amount of times that something is loaded into the QWORD. A few other optimizations are quite obvious and rather minor, but sum up to another interesting speedup.
unsigned short out[64];
const unsigned long long Xmask = 0x249024902490249ull;
const unsigned long long Ymask = 0x7000700070007u;
unsigned long long x = (in >> 14 & 0xFFFFu) | (in >> 20 & 0xFFFFu) > 26 & 0xFFFFu) > 32) << 48;
unsigned long long y;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[32] = (unsigned short)(y >> 48);
out[26] = (unsigned short)(y >> 32);
out[20] = (unsigned short)(y >> 16);
out[14] = (unsigned short)(y );
x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[33] = (unsigned short)(y >> 48);
out[27] = (unsigned short)(y >> 32);
out[21] = (unsigned short)(y >> 16);
out[15] = (unsigned short)(y );
[snisnap]
x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[37] = (unsigned short)(y >> 48);
out[31] = (unsigned short)(y >> 32);
out[25] = (unsigned short)(y >> 16);
out[19] = (unsigned short)(y );
x >>= 1;
x &= 0xFFFF000000000000ul;
x |= (in & 0xFFFFu) | (in >> 5 & 0xFFFFu) > 10 & 0xFFFFu) << 32;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[38] = (unsigned short)(y >> 48);
out[10] = (unsigned short)(y >> 32);
out[ 5] = (unsigned short)(y >> 16);
out[ 0] = (unsigned short)(y );
[snipsnap]
x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[ 9] = (unsigned short)(y >> 16);
out[ 4] = (unsigned short)(y );
Running times for 50 million executions in native c++ (all ouputs verified to match ^^) compiled as a 64 bit binary on my pc:
Array based solution: ~5700 ms
Naive hardcoded solution: ~4200 ms
The first solution: ~2400 ms
The second solution: ~1600 ms
A suggestion that I don't want to code right now is to use a loop, an array to hold partial results, and constants to pick up the bits m at a time.
loop
s[3*i] += x & (1 << 0);
s[3*i+1] += x & (1 << 1);
s[3*i+2] += x & (1 << 2);
x >> 3;
This will pick too many bits in each sum. But you can also keep track of the intermediate results and subtract from the sums as you go, to account for the bit that may not be there anymore.
loop
s[3*i] += p[3*i] = x & (1 << 0);
s[3*i+1] += p[3*i+1] = x & (1 << 1);
s[3*i+2] += p[3*i+2] = x & (1 << 2);
s[3*i] -= p[3*i-10];
s[3*i+1] -= p[3*i-9];
s[3*i+2] -= p[3*i-8];
x >> 3;
with the appropriate bounds checking, of course.
The fastest approach is to just hardcode the sums themselves.
s[0] = (x & (1<<0)) + (x & (1<<3)) + (x & (1<<6)) + (x & (1<<9));
etc. (The shifts occur at compile time.)
Related
for the CUDA kernel function, get branching divergence shown below, how to optimize it?
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
if (gx % 4 == 0)
val = op1(val);
else if (gx % 4 == 1)
val = op2(val);
else if (gx % 4 == 2)
val = op3(val);
else if (gx % 4 == 3)
val = op4(val);
g_data[gx] = val;
If I were programming in CUDA, I certainly wouldn't do any of this. However to answer your question:
how to avoid thread divergence in this CUDA kernel?
You could do something like this:
int gx = threadIdx.x + blockDim.x * blockIdx.x;
val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
Here is a full test case:
$ cat t1914.cu
#include <iostream>
__device__ float op1(float val) { return val + 1.0f;}
__device__ float op2(float val) { return val + 2.0f;}
__device__ float op3(float val) { return val + 3.0f;}
__device__ float op4(float val) { return val + 4.0f;}
__global__ void k(float *g_data){
int gx = threadIdx.x + blockDim.x * blockIdx.x;
float val = g_data[gx];
int gx_bit_0 = gx & 1;
int gx_bit_1 = (gx & 2) >> 1;
val = (1-gx_bit_1)*(1-gx_bit_0)*op1(val) + (1-gx_bit_1)*(gx_bit_0)*op2(val) + (gx_bit_1)*(1-gx_bit_0)*op3(val) + (gx_bit_1)*(gx_bit_0)*op4(val);
g_data[gx] = val;
}
const int N = 32;
int main(){
float *data;
cudaMallocManaged(&data, N*sizeof(float));
for (int i = 0; i < N; i++) data[i] = 1.0f;
k<<<1,N>>>(data);
cudaDeviceSynchronize();
for (int i = 0; i < N; i++) std::cout << data[i] << std::endl;
}
$ nvcc -o t1914 t1914.cu
$ compute-sanitizer ./t1914
========= COMPUTE-SANITIZER
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
2
3
4
5
========= ERROR SUMMARY: 0 errors
$
Solution by changing the work per thread
The best solution with the existing data layout is to let every thread compute 4 consecutive values. It's better to have fewer threads that can work properly than have more that can't.
float* g_data;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
g_data[4 * gx] = op1(g_data[4 * gx]);
g_data[4 * gx + 1] = op2(g_data[4 * gx + 1]);
g_data[4 * gx + 2] = op3(g_data[4 * gx + 2]);
g_data[4 * gx + 3] = op4(g_data[4 * gx + 3]);
If the size of g_data is not a multiple of 4, put an if around the index operations. If it is always a multiple of 4 and properly aligned, load and store 4 values as a float4 for better performance.
Solution by reordering the work
As all my talk about float4 may have suggested, your input data appears to be some form of 2D structure where one every four elements share a similar function. Maybe it is an array of structs or an array of vectors -- in other words, a matrix.
For the purpose of explaining what I mean, I consider it a Nx4 matrix. If you transpose this into a 4xN matrix and apply a kernel to this, most of your problems disappear. Because then entries for which the same operation has to be done are placed next to each other in memory and that makes writing an efficient kernel easier. Something like this:
float* g_data;
int rows_in_g;
int gx = threadIdx.x + blockDim.x * blockIdx.x;
int gy = threadIdx.y;
float& own_g = g_data[gx + rows_in_g * gy];
switch(gy) {
case 0: own_g = op1(own_g); break;
case 1: own_g = op2(own_g); break;
case 2: own_g = op3(own_g); break;
case 3: own_g = op4(own_g); break;
default: break;
}
Start this as a 2D kernel with blocksize x=32, y=4 and gridsize x=N/32, y=1.
Now your kernel is still divergent, but all threads within a warp will execute the same case and access consecutive floats in memory. That's the best you can achieve. Of course this all depends on whether you can change the data layout.
I am slightly confused about finding the minimum number of bits for an unsigned magnitude and 2's compliment.
This has been my reasoning so far:
For example,
a) 243 decimal
Since 2^8 = 256, Unsigned and 2's compliment would both need a minimum 8 bits.
b) -56 decimal
This is impossible for unsigned.
2^6 = 64. One more bit is needed to show it is negative, so minimum 7 bits.
Is my reasoning correct?
The "bits needed" for unsigned is just the most significant bit (+1, depending on the definition for MSB), and for two's complement you can just negate the value and subtract one to make it positive, then add another bit for the sign flag.
int LeadingZeroCount(long value) {
// http://en.wikipedia.org/wiki/Hamming_weight
unsigned long x = value;
x |= (x >> 1); x |= (x >> 2); x |= (x >> 4);
x |= (x >> 8); x |= (x >> 16); x |= (x >> 32);
x -= (x >> 1) & 0x5555555555555555;
x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
x += x >> 8; x += x >> 16; x += x >> 32;
return (sizeof(value) << 3) - (x & 0x7F);
}
int MostSignificantBit(long value) {
return (sizeof(value) << 3) - LeadingZeroCount(value);
}
int BitsNeededUnsigned(unsigned long value) {
return MostSignificantBit(value);
}
int BitsNeededTwosComplement(long value) {
if (value < 0)
return BitsNeededUnsigned(-value - 1) + 1;
else
return BitsNeededUnsigned(value);
}
int main() {
printf("%d\n", BitsNeededUnsigned(243));
printf("%d\n", BitsNeededTwosComplement(243));
printf("%d\n", BitsNeededTwosComplement(-56));
return 0;
}
That's based on your definition of the problem, at least. To me it seems like +243 would need 9 bits for two's complement since the 0 for the sign bit is still relevant.
What I am asking is if it is possible to join all bits in 2 different numbers.
A pseudo-code example:
bytes=array(0x04, 0x3F);
//place bitwise black magic here
print 0x043F;
Another example:
bytes=array(0xFF, 0xFFFF);
//place bitwise black magic here
print 0xFFFFFF;
Yet another example:
bytes=array(0x34F3, 0x54FD);
//place bitwise black magic here
print 0x34F354FD;
I want to restrict this to only and only bitwise operators (>>, <<, |, ^, ~ and &).
This should work at least in PHP and Javascript.
Is this possible in ANY way?
If I'm not being clear, please ask your doubts in a comment.
If I understand your question correctly,
This should be the answer in php:
$temp = $your_first_value << strlen(dechex($the_length_of_the_second_value_in_hex))
$result = $temp | $your_second_value
print dechex($result)
Update: instead of + use the | operator
This problem hinges completely on being able to determine the position of the leftmost 1 in an integer. One way to do that is by "smearing the bits right" and then counting the 1's:
Smearing to the right:
int smearright(int x) {
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x;
}
Easy, only bitwise operators there. Counting the bits however involves some sort of addition:
int popcnt(int x) {
x = add(x & 0x55555555, (x >> 1) & 0x55555555);
x = add(x & 0x33333333, (x >> 2) & 0x33333333);
x = add(x & 0x0f0f0f0f, (x >> 4) & 0x0f0f0f0f);
x = add(x & 0x00ff00ff, (x >> 8) & 0x00ff00ff);
x = add(x & 0xffff, (x >> 16) & 0xffff);
return x;
}
But that's OK, add can be implemented as
int add(int x, int y) {
int p = x ^ y;
int g = x & y;
g |= p & (g << 1);
p &= p << 1;
g |= p & (g << 2);
p &= p << 2;
g |= p & (g << 4);
p &= p << 4;
g |= p & (g << 8);
p &= p << 8;
g |= p & (g << 16);
return x ^ y ^ (g << 1);
}
Putting it together:
join = (left << popcnt(smearright(right))) | right;
It's obviously much easier if you had addition (no add function), perhaps surprisingly though, it's even simpler than that with multiplication:
join = (left * (smearright(right) + 1)) | right;
No more popcnt at all!
Implementing multiplication in terms of bitwise operators wouldn't help, that's much worse and I'm not sure you can even do it with the listed operators (unless the right shift is an arithmetic shift, but then it's still a terrible thing involving 32 additions each of which are function themselves).
There were no "sneaky tricks" in this answer, such as using conditions that implicitly test for equality with zero ("hidden" != 0 in an if, ?:, while etc), and the control flow is actually completely linear (function calls are just there to prevent repeated code, everything can be inlined).
Here's an alternative. Instead of taking the popcnt, do a weird variable shift:
int shift_by_mask(int x, int mask) {
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
return x;
}
Ok that doesn't make me happy, but here's how you'd use it:
join = shift_by_mask(left, smearright(right)) | right;
Depending on what endian-ness your machine is, you might have to reverse the order of bytes[0] and bytes1 below:
uint8_t bytes[2] = { 0x04, 0x3f };
uint16_t result = (bytes[0] << 8) | bytes[1];
(This is in C, shouldn't be hard to translate to PHP etc., the languages and operators are similar enough)
Update:
OK, now that you've clarified what you want, the basic approach is still the same. What you can do instead is count the number of bits in the right number, then do the bitshift as above on the left number, just with the dynamic number of bits. This works as long as you don't have more bits than fit into the largest numeric type that your language/platform support, so in this example 64 bits.
int rightMaxBits = 0;
uint64_t leftNum = 0x04, rightNum = 0x3f;
uint64_t rightNumCopy = rightNum;
while( rightNumCopy )
{
rightNumCopy >>= 1;
rightMaxBits++;
}
uint64_t resultNum = (leftNum << rightMaxBits) | rightNum;
(Thanks for the bit-counting algo to this SO thread) For signed numbers, I'd suggest you use abs() on the numbers before you call this and then later re-apply the sign in whatever way you want.
Here is my understanding of the execution pattern of CUDA threads. If a particular thread meets a condition it will execute the kernel. Often the indexing and accesses of each thread is done using its thread and block ID. But, when i came across the following piece of code, i stumbled. As for the correctness, this code gives perfectly correct result.
__global__ void kernel0(int *a)
{
int b0 = blockIdx.x;
int t0 = threadIdx.x;
__shared__ int shared_a[32][33];
for (int g5 = 0; g5 <= 96; g5 += 32) {
for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
for (int c1 = t0; c1 <= min(32, -g5 + 99); c1 += 32)
shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)];
__syncthreads();
if (32 * b0 + t0 <= 99)
for (int c2 = 0; c2 <= min(31, -g5 + 98); c2 += 1)
shared_a[t0][c2 + 1] = (shared_a[t0][c2] + 5);
__syncthreads();
if (((t0 + 31) % 32) + g5 <= 98)
for (int c0 = 0; c0 <= min(31, -32 * b0 + 99); c0 += 1)
a[(32 * b0 + c0) * 100 + (((t0 + 31) % 32) + g5 + 1)] = shared_a[c0][((t0 + 31) % 32) + 1];
__syncthreads();
}
}
My question is which thread-id inside a blocksize of 32 executes the first 3 for-loop's ?
Short answer
Every thread will execute the for loops, but only threads with index in the interval [0,min(31, -32 * b0 + 99)][t0, c1 <= min(32, -g5 + 99)] do some work at the inner statement, namely
shared_a[c0][c1] = a[(32 * b0 + c0) * 100 + (g5 + c1)]
About the mapping mechanism
The way you have to assign to each thread its correspondent work is indexing. For example the following statement will be only executed by the thread 0 of each block:
if( threadIdx.x == 0){
// some code
}
While this one will be only execute by the thread with and index 0 a one-dimensional grid:
if( threadIdx.x + blockIdx.x*blockDim.x == 0){
// some code
}
This code (from a simple array reduction) is it also usefull to illustrate such behavior:
for( unsigned int s = 1; s < blockDim.x; s *= 2){
int index = 2*s*tid;
if( index < blockDim.x){
sdata[index] += sdata[index + s];
}
__syncthreads();
}
All threads in a block execute the for loop and also all of them have their own value for the index variable. Then, the if statement prevents some threads to execute the addition. Finally the addition is only performed by threads with thread number "index".
As you see this makes some threads to be idle while other could have a lot of work to do (load imbalance), so it is desirable an homogeneous workload across the grid to maximize the performance.
Learning material.
This could be somewhat confusing at first, so I encourage you to read the CUDA C programming guide included in the CUDA toolkit. Play around with the matrix-matrix multiplication, vector addition and vector reduction.
A very comprehensive guide is the "Programming massively parallel processors" book, by David B. Kirk and Wen-mei W. Hwu.
If I have a integer number n, how can I find the next number k > n such that k = 2^i, with some i element of N by bitwise shifting or logic.
Example: If I have n = 123, how can I find k = 128, which is a power of two, and not 124 which is only divisible by two. This should be simple, but it eludes me.
For 32-bit integers, this is a simple and straightforward route:
unsigned int n;
n--;
n |= n >> 1; // Divide by 2^k for consecutive doublings of k up to 32,
n |= n >> 2; // and then or the results.
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n++; // The result is a number of 1 bits equal to the number
// of bits in the original number, plus 1. That's the
// next highest power of 2.
Here's a more concrete example. Let's take the number 221, which is 11011101 in binary:
n--; // 1101 1101 --> 1101 1100
n |= n >> 1; // 1101 1100 | 0110 1110 = 1111 1110
n |= n >> 2; // 1111 1110 | 0011 1111 = 1111 1111
n |= n >> 4; // ...
n |= n >> 8;
n |= n >> 16; // 1111 1111 | 1111 1111 = 1111 1111
n++; // 1111 1111 --> 1 0000 0000
There's one bit in the ninth position, which represents 2^8, or 256, which is indeed the next largest power of 2. Each of the shifts overlaps all of the existing 1 bits in the number with some of the previously untouched zeroes, eventually producing a number of 1 bits equal to the number of bits in the original number. Adding one to that value produces a new power of 2.
Another example; we'll use 131, which is 10000011 in binary:
n--; // 1000 0011 --> 1000 0010
n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011
n |= n >> 2; // 1100 0011 | 0011 0000 = 1111 0011
n |= n >> 4; // 1111 0011 | 0000 1111 = 1111 1111
n |= n >> 8; // ... (At this point all bits are 1, so further bitwise-or
n |= n >> 16; // operations produce no effect.)
n++; // 1111 1111 --> 1 0000 0000
And indeed, 256 is the next highest power of 2 from 131.
If the number of bits used to represent the integer is itself a power of 2, you can continue to extend this technique efficiently and indefinitely (for example, add a n >> 32 line for 64-bit integers).
There is actually a assembly solution for this (since the 80386 instruction set).
You can use the BSR (Bit Scan Reverse) instruction to scan for the most significant bit in your integer.
bsr scans the bits, starting at the
most significant bit, in the
doubleword operand or the second word.
If the bits are all zero, ZF is
cleared. Otherwise, ZF is set and the
bit index of the first set bit found,
while scanning in the reverse
direction, is loaded into the
destination register
(Extracted from: http://dlc.sun.com/pdf/802-1948/802-1948.pdf)
And than inc the result with 1.
so:
bsr ecx, eax //eax = number
jz #zero
mov eax, 2 // result set the second bit (instead of a inc ecx)
shl eax, ecx // and move it ecx times to the left
ret // result is in eax
#zero:
xor eax, eax
ret
In newer CPU's you can use the much faster lzcnt instruction (aka rep bsr). lzcnt does its job in a single cycle.
A more mathematical way, without loops:
public static int ByLogs(int n)
{
double y = Math.Floor(Math.Log(n, 2));
return (int)Math.Pow(2, y + 1);
}
Here's a logic answer:
function getK(int n)
{
int k = 1;
while (k < n)
k *= 2;
return k;
}
Here's John Feminella's answer implemented as a loop so it can handle Python's long integers:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
n -= 1 # greater than OR EQUAL TO n
shift = 1
while (n+1) & n: # n+1 is not a power of 2 yet
n |= n >> shift
shift <<= 1
return n + 1
It also returns faster if n is already a power of 2.
For Python >2.7, this is simpler and faster for most N:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
return 2**(n-1).bit_length()
This answer is based on constexpr to prevent any computing at runtime when the function parameter is passed as const
Greater than / Greater than or equal to
The following snippets are for the next number k > n such that k = 2^i
(n=123 => k=128, n=128 => k=256) as specified by OP.
If you want the smallest power of 2 greater than OR equal to n then just replace __builtin_clzll(n) by __builtin_clzll(n-1) in the following snippets.
C++11 using GCC or Clang (64 bits)
#include <cstdint> // uint64_t
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * 8 - __builtin_clzll(n));
}
Enhancement using CHAR_BIT as proposed by martinec
#include <cstdint>
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * CHAR_BIT - __builtin_clzll(n));
}
C++17 using GCC or Clang (from 8 to 128 bits)
#include <cstdint>
template <typename T>
constexpr T nextPowerOfTwo64 (T n)
{
T clz = 0;
if constexpr (sizeof(T) <= 32)
clz = __builtin_clzl(n); // unsigned long
else if (sizeof(T) <= 64)
clz = __builtin_clzll(n); // unsigned long long
else { // See https://stackoverflow.com/a/40528716
uint64_t hi = n >> 64;
uint64_t lo = (hi == 0) ? n : -1ULL;
clz = _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
return T{1} << (CHAR_BIT * sizeof(T) - clz);
}
Other compilers
If you use a compiler other than GCC or Clang, please visit the Wikipedia page listing the Count Leading Zeroes bitwise functions:
Visual C++ 2005 => Replace __builtin_clzl() by _BitScanForward()
Visual C++ 2008 => Replace __builtin_clzl() by __lzcnt()
icc => Replace __builtin_clzl() by _bit_scan_forward
GHC (Haskell) => Replace __builtin_clzl() by countLeadingZeros()
Contribution welcome
Please propose improvements within the comments. Also propose alternative for the compiler you use, or your programming language...
See also similar answers
nulleight's answer
ydroneaud's answer
Here's a wild one that has no loops, but uses an intermediate float.
// compute k = nextpowerof2(n)
if (n > 1)
{
float f = (float) n;
unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
k = t << (t < n);
}
else k = 1;
This, and many other bit-twiddling hacks, including the on submitted by John Feminella, can be found here.
assume x is not negative.
int pot = Integer.highestOneBit(x);
if (pot != x) {
pot *= 2;
}
If you use GCC, MinGW or Clang:
template <typename T>
T nextPow2(T in)
{
return (in & (T)(in - 1)) ? (1U << (sizeof(T) * 8 - __builtin_clz(in))) : in;
}
If you use Microsoft Visual C++, use function _BitScanForward() to replace __builtin_clz().
function Pow2Thing(int n)
{
x = 1;
while (n>0)
{
n/=2;
x*=2;
}
return x;
}
Bit-twiddling, you say?
long int pow_2_ceil(long int t) {
if (t == 0) return 1;
if (t != (t & -t)) {
do {
t -= t & -t;
} while (t != (t & -t));
t <<= 1;
}
return t;
}
Each loop strips the least-significant 1-bit directly. N.B. This only works where signed numbers are encoded in two's complement.
What about something like this:
int pot = 1;
for (int i = 0; i < 31; i++, pot <<= 1)
if (pot >= x)
break;
You just need to find the most significant bit and shift it left once. Here's a Python implementation. I think x86 has an instruction to get the MSB, but here I'm implementing it all in straight Python. Once you have the MSB it's easy.
>>> def msb(n):
... result = -1
... index = 0
... while n:
... bit = 1 << index
... if bit & n:
... result = index
... n &= ~bit
... index += 1
... return result
...
>>> def next_pow(n):
... return 1 << (msb(n) + 1)
...
>>> next_pow(1)
2
>>> next_pow(2)
4
>>> next_pow(3)
4
>>> next_pow(4)
8
>>> next_pow(123)
128
>>> next_pow(222)
256
>>>
Forget this! It uses loop !
unsigned int nextPowerOf2 ( unsigned int u)
{
unsigned int v = 0x80000000; // supposed 32-bit unsigned int
if (u < v) {
while (v > u) v = v >> 1;
}
return (v << 1); // return 0 if number is too big
}
private static int nextHighestPower(int number){
if((number & number-1)==0){
return number;
}
else{
int count=0;
while(number!=0){
number=number>>1;
count++;
}
return 1<<count;
}
}
// n is the number
int min = (n&-n);
int nextPowerOfTwo = n+min;
#define nextPowerOf2(x, n) (x + (n-1)) & ~(n-1)
or even
#define nextPowerOf2(x, n) x + (x & (n-1))