Use of half2 in CUDA

Use of half2 in CUDA - cuda

I am trying to use half2, but I run into an error, namely,
error: class "__half2" has no member "y"
The section of code where the error occurs is as follows:
uint8_t V_ [128]; // some elements (uint8), to save space
float V_C[128]; // storing the diff to use later
half2 *C_ = C.elements; // D halfs stored as half2, to be read
Cvalue = 0.0;
for (d = 0; d < D; d+=2)
{
V_C [d ] = V_[d] - __half2float(C_[d/2].x) ;
V_C [d+1] = V_[d+1] - __half2float(C_[d/2].y) ;
Cvalue += V_C [d] * V_C [d] ;
Cvalue += V_C [d+1] * V_C [d+1];
}
Any help please?
Update:
Thank you for your help! I finally used the following...
uint8_t V_ [128] ;
float V_C[128] ;
const half2 *C_ = C.elements;
Cvalue = 0.0;
float2 temp_;
for (d = 0; d < D; d+=2)
{
temp_ = __half22float2(C_[d/2]);
V_C [d ] = V_[d] - temp_.x ;
V_C [d+1] = V_[d+1] - temp_.y ;
Cvalue += V_C [d] * V_C [d] ;
Cvalue += V_C [d+1] * V_C [d+1];
}
I got a slight speedup in my particular application, as loads from global memory was the bottleneck...

You cannot access parts of a half2 with dot operator, you should use intrinsic functions for that.
From the documentation:
__CUDA_FP16_DECL__ float __high2float ( const __half2 a )
Converts high 16 bits of half2 to float and returns the result.
__CUDA_FP16_DECL__ __half __high2half ( const __half2 a )
Returns high 16 bits of half2 input.
__CUDA_FP16_DECL__ __half2 __high2half2 ( const __half2 a )
Extracts high 16 bits from half2 input.
__CUDA_FP16_DECL__ __half2 __highs2half2 ( const __half2 a, const __half2 b )
Extracts high 16 bits from each of the two half2 inputs and combines into one half2 number.
__CUDA_FP16_DECL__ float __low2float ( const __half2 a )
Converts low 16 bits of half2 to float and returns the result.
__CUDA_FP16_DECL__ __half __low2half ( const __half2 a )
Returns low 16 bits of half2 input.
__CUDA_FP16_DECL__ __half2 __low2half2 ( const __half2 a )
Extracts low 16 bits from half2 input.
__CUDA_FP16_DECL__ __half2 __lowhigh2highlow ( const __half2 a )
Swaps both halves of the half2 input.
__CUDA_FP16_DECL__ __half2 __lows2half2 ( const __half2 a, const __half2 b )
Extracts low 16 bits from each of the two half2 inputs and combines into one half2 number.
More than that, depending on what type C.elements is, this line
half2 *C_ = C.elements; // D halfs stored as half2, to be read
might be wrong (if C.elements is a half*. Comment is unclear here).
half2 is not a pair of halfs.
Indeed, in current implementation half2 is just an unsigned int wrapped in a struct:
// cuda_fp16.h
typedef struct __align__(2) {
unsigned short x;
} __half;
typedef struct __align__(4) {
unsigned int x;
} __half2;
#ifndef CUDA_NO_HALF
typedef __half half;
typedef __half2 half2;
#endif /*CUDA_NO_HALF*/
No one said that an array of halfs can be accessed as an array of half2s.

Related

How to do dot product between matrices in caffe?

In inner product layer, I need to multiply (top_diff * bottom_data) .* (2*weight). First we calculate (result = top_diff * bottom_data) as matrix multiplication in caffe_cpu_gemm and then do a dot product between weight and result.
More explanation is defined as follow:
const Dtype* weight = this->blobs_[0]->cpu_data();
if (this->param_propagate_down_[0]) {
const Dtype* top_diff = top[0]->cpu_diff();
const Dtype* bottom_data = bottom[0]->cpu_data();
caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans, N_, K_, M_, (Dtype)1.,
top_diff, bottom_data, (Dtype)1., this->blobs_[0]->mutable_cpu_diff());
}
For more understanding, I checked math_function.c. It is implemented as follows:
template<>
void caffe_cpu_gemm<float>(const CBLAS_TRANSPOSE TransA,
const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,
const float alpha, const float* A, const float* B, const float beta,
float* C) {
int lda = (TransA == CblasNoTrans) ? K : M;
int ldb = (TransB == CblasNoTrans) ? N : K;
cblas_sgemm(CblasRowMajor, TransA, TransB, M, N, K, alpha, A, lda, B,
ldb, beta, C, N);
}
I think I should perform multiplication (result = top_diff * bottom_data) in caffe_cpu_gemm() and after that do dot product with weight. how should I do?!
Many thanks!!!! Any advice would be appreciated!

If you just want to perform dot product between two matrices, you can use the following function to multiply matrices on CPU,
void caffe_mul<float>(const int n, const float* a, const float* b, float* y)
If you want to do the same operation on a GPU, use this template
void caffe_gpu_mul<float>(const int N, const float* a, const float* b, float* y)
a and b are you matrices and c will contain the final result. N is the total number of elements in your matrix.
You can also use the 'Eltwise' layer, which already does this.

How to evaluate memory time and compute time for CUDA kernel?

I was working on an algorithm in CUDA and wanted to understand the performance of my kernel so I could optimize it appropriately.
I am required to determine whether my kernel is compute bound or memory bound using source code modifications only? NVIDIA docs suggest I run the kernel without memory accesses to determine compute time and similarly run the kernel without any computations to determine memory time.
I do not know how to appropriately modify my source code so that I can achieve the above? How can you perform computations without memory access (or how can you compute a result without accessing the variables stored in the memory?). Could you suggest an example for the memory and computation case in the following code so I can work on modifying it completely myself...
__device__ inline float cndGPU(float d)
{
const float A1 = 0.31938153f;
const float A2 = -0.356563782f;
const float A3 = 1.781477937f;
const float A4 = -1.821255978f;
const float A5 = 1.330274429f;
const float RSQRT2PI = 0.39894228040143267793994605993438f;
float
K = 1.0f / (1.0f + 0.2316419f * fabsf(d));
float
cnd = RSQRT2PI * __expf(- 0.5f * d * d) *
(K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))));
if (d > 0)
cnd = 1.0f - cnd;
return cnd;
}
__device__ inline void BlackScholesBodyGPU(
float &CallResult,
float &PutResult,
float S, //Stock price
float X, //Option strike
float T, //Option years
float R, //Riskless rate
float V //Volatility rate
)
{
float sqrtT, expRT;
float d1, d2, CNDD1, CNDD2;
sqrtT = sqrtf(T);
d1 = (__logf(S / X) + (R + 0.5f * V * V) * T) / (V * sqrtT);
d2 = d1 - V * sqrtT;
CNDD1 = cndGPU(d1);
CNDD2 = cndGPU(d2);
//Calculate Call and Put simultaneously
expRT = __expf(- R * T);
CallResult = S * CNDD1 - X * expRT * CNDD2;
PutResult = X * expRT * (1.0f - CNDD2) - S * (1.0f - CNDD1);
}

How I see it.
If you have:
float cndGPU(float d) {
const float a = 1;
const float b = 2;
float c;
c = a + b + arr[d];
return c;
}
Checking compute time without memory access - literally write all your computing expressions into one and without using variables:
return 1 + 2 + 3; //just put some number that can be in arr[d]
Checking the memory access - literally the opposite:
`
const float a = 1;
const float b = 2;
float c;
c = arr[d]; //here we have our memory access
return c;

Learning CUDA, but currently stuck

So I've been trying to learn CUDA as of late, but am currently stuck and don't know what I'm doing wrong. I am trying to set the initial value of the opool array based on a random float between 0 and 1. If anyone could shed some light on what I did wrong it would be greatly appreciated.
Note - I omitted some code for brevity (cudaFree() & free() calls mainly). I apologize if I left any code of importance out.
__global__ void FirstLoop( int *opool, float *randomSet, int omax, int anumber )
{
int tid_loci = threadIdx.x;
int tid_2 = threadIdx.y;
int bid_omax = blockIdx.x;
int index = omax*tid_loci*2 + omax*tid_2 + bid_omax;
float r = randomSet[ index ];
// Commented out code is what it should be set to, but they are set to 5 or 15
// to determine if the values are correctly being set.
if ( r < 0.99 )
opool[ index ] = 15; //(int)((r * 100.0) * -1.0);
else
opool[ index ] = 5; //(int)((r)*(float)(anumber-4)) +5;
}
int main()
{
int loci = 10;
int omax = 20;
// Data stored on the host
int *h_opool;
float *h_randomSet;
// Data stored on the device
int *d_opool;
float *d_randomSet;
int poolSize = helpSize * omax;
int randomSize = loci * 2 * omax * sizeof(float);
// RESIZE ARRAYS TO NEEDED SIZE
h_opool = (int*)malloc( poolSize );
h_randomSet= (float*)malloc( randomSize );
cudaMalloc( &d_opool, poolSize );
cudaMalloc( &d_randomSet,randomSize );
for (sim=0; sim<smax; sim++)
{
for (i=0; i<poolSize; i++)
h_randomSet[i] = rndm();
dim3 blocks(omax);
dim3 thread(loci, 2);
cudaMemcpy( d_randomSet, h_randomSet, randomSize, cudaMemcpyHostToDevice );
cudaMemcpy( d_opool, h_opool, poolSize, cudaMemcpyHostToDevice );
FirstLoop<<< blocks, thread >>>(d_opool, d_randomSet, omax, anumber );
cudaMemcpy( h_opool, d_opool, poolSize, cudaMemcpyDeviceToHost );
// Here is when I call printf to see the values stored in h_opool, but they are
// completely wrong
}
}
float rndm()
{
int random = rand();
return ((float)random / (float)RAND_MAX);
}

Change the following
int index = omax*tid_loci*2 + omax*tid_2 + bid_omax;
to
int index = bid_omax * tid_2 + tid_loci;
However a block configuration of 10x2 may not be the most ideal one. Try using 32 x 1 or 16 x 2.

1D problems in CUDA and HPC

I'm looking for some 1D problems in CUDA and HPC, e.g. Black Scholes.
By 1D problems, I mean problems in which all the work is done on 1D arrays. Although matrix multiplication can be expressed in this way, I want problems in which the basic problem is just 1D.
I am trying to develop a 1D library for CUDA and would need some benchmark problems to test it. I realize that a lot of real world problems are expressed as 2D, I would really like to see some real world 1D problems.
Thanks.
EDIT: Thanks for all the answers. It'll be great if the answers contain more HPC problems, e.g. Black Scholes, rather than just generic algorithms.
Thanks.

A common problem in parallel programing is a reduction: You are given an array of numbers and you have to compute a "prefix sum", that is, every element stores a sum of all preceidings elements (+ itself or not. I prefer inclusive).
It is fairly simple problem, but since it is often repeated many times in more complex algorithms, having that efficient is cruicial.
Another common problem is sorting.
There already some papers on that topic, take this one for example:
enter link description here
I think it is a good problem to start with, to solve bigger problems on top of it.

A simple problem you can use for 1 to 3 dimensions is the heat equation. There are several different numerical methods for solving it, some of them can be implementes in parallel.
A method that works at least with OpenMp and MPI is the finite difference method. I suppose if you combine it with a clever stencil you should be able to implement it efficently in Cuda C.

A classical 1D example is provided by the heat equation.
Below, I'm posting a concrete, fully worked CPU/GPU example on this topic exploiting the Jacobi solution scheme. Please, note that two time-step kernels are provided, one not using shared memory and one using shared memory.
#include <stdio.h>
#include <stdlib.h>
#include <thrust\device_vector.h>
#include "Utilities.cuh"
#define BLOCKSIZE 512
/****************************/
/* CPU CALCULATION FUNCTION */
/****************************/
void HeatEquation1DCPU(float * __restrict__ h_T, int *Niter, const float T0, const float Q_N_1, const float dx, const float k, const float rho,
const float cp, const float alpha, const float dt, const float maxErr, const int maxIterNumber, const int N)
{
float *h_DeltaT = (float *)malloc(N * sizeof(float));
// --- Enforcing boundary condition at the left end.
*h_T = T0;
h_DeltaT[0] = 0.f;
float current_max;
do {
// --- Internal region between the two boundaries.
for (int i = 1; i < N - 1; i++) h_DeltaT[i] = dt * alpha * ((h_T[i - 1] + h_T[i + 1] - 2.f * h_T[i]) / (dx * dx));
// --- Enforcing boundary condition at the right end.
h_DeltaT[N - 1] = dt * 2.f * ((k * ((h_T[N - 2] - h_T[N - 1]) / dx) + Q_N_1) / (dx * rho * cp));
// --- Update the temperature and find the maximum DeltaT over all nodes
current_max = h_DeltaT[0]; // --- Remember: h_DeltaT[0] = 0
for (int i = 1; i < N; i++)
{
h_T[i] = h_T[i] + h_DeltaT[i]; // h_T[0] keeps
current_max = abs(h_DeltaT[i]) > current_max ? abs(h_DeltaT[i]) : current_max;
}
// --- Increase iteration counter
(*Niter)++;
} while (*Niter < maxIterNumber && current_max > maxErr);
delete [] h_DeltaT;
}
/**************************/
/* GPU CALCULATION KERNEL */
/**************************/
__global__ void HeatEquation1DGPU_IterationKernel(float * __restrict__ d_T, float * __restrict__ d_DeltaT, const float T0, const float Q_N_1, const float dx, const float k, const float rho,
const float cp, const float alpha, const float dt, const float maxErr, const int maxIterNumber, const int N)
{
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N) {
// --- Internal region between the two boundaries.
if ((tid > 0) && (tid < N - 1) ) d_DeltaT[tid] = dt * alpha *((d_T[tid - 1] + d_T[tid + 1] - 2.f * d_T[tid]) / (dx * dx));
// --- Enforcing boundary condition at the left end.
if (tid == 0) d_DeltaT[0] = 0.f;
// --- Enforcing boundary condition at the right end.
if (tid == N - 1) d_DeltaT[tid] = dt * 2.f * ((k * ((d_T[tid - 1] - d_T[tid]) / dx) + Q_N_1) / (dx * rho * cp));
// --- Update the temperature
d_T[tid] = d_T[tid] + d_DeltaT[tid];
d_DeltaT[tid] = abs(d_DeltaT[tid]);
}
}
__global__ void HeatEquation1DGPU_IterationSharedKernel(float * __restrict__ d_T, float * __restrict__ d_DeltaT, const float T0, const float Q_N_1, const float dx, const float k, const float rho,
const float cp, const float alpha, const float dt, const float maxErr, const int maxIterNumber, const int N)
{
const int tid = blockIdx.x * blockDim.x + threadIdx.x;
// --- Shared memory has 0, 1, ..., BLOCKSIZE - 1, BLOCKSIZE locations, so it has BLOCKSIZE locations + 2 (left and right) halo cells.
__shared__ float d_T_shared[BLOCKSIZE + 2]; // --- Need to know BLOCKSIZE beforehand
if (tid < N) {
// --- Load data from global memory to shared memory locations 1, 2, ..., BLOCKSIZE - 1
d_T_shared[threadIdx.x + 1] = d_T[tid];
// --- Left halo cell
if ((threadIdx.x == 0) && (tid > 0)) { d_T_shared[0] = d_T[tid - 1]; }
// --- Right halo cell
if ((threadIdx.x == blockDim.x - 1) && (tid < N - 1)) { d_T_shared[threadIdx.x + 2] = d_T[tid + 1]; }
__syncthreads();
// --- Internal region between the two boundaries.
if ((tid > 0) && (tid < N - 1) ) d_DeltaT[tid] = dt * alpha *((d_T_shared[threadIdx.x] + d_T_shared[threadIdx.x + 2] - 2.f * d_T_shared[threadIdx.x + 1]) / (dx * dx));
// --- Enforcing boundary condition at the left end.
if (tid == 0) d_DeltaT[0] = 0.f;
// --- Enforcing boundary condition at the right end.
if (tid == N - 1) d_DeltaT[tid] = dt * 2.f * ((k * ((d_T_shared[threadIdx.x] - d_T_shared[threadIdx.x + 1]) / dx) + Q_N_1) / (dx * rho * cp));
// --- Update the temperature
d_T[tid] = d_T[tid] + d_DeltaT[tid];
d_DeltaT[tid] = abs(d_DeltaT[tid]);
}
}
/****************************/
/* GPU CALCULATION FUNCTION */
/****************************/
void HeatEquation1DGPU(float * __restrict__ d_T, int *Niter, const float T0, const float Q_N_1, const float dx, const float k, const float rho,
const float cp, const float alpha, const float dt, const float maxErr, const int maxIterNumber, const int N)
{
// --- Absolute values of DeltaT
float *d_DeltaT; gpuErrchk(cudaMalloc(&d_DeltaT, N * sizeof(float)));
// --- Enforcing boundary condition at the left end.
gpuErrchk(cudaMemcpy(d_T, &T0, sizeof(float), cudaMemcpyHostToDevice));
float current_max = 0.f;
do {
//HeatEquation1DGPU_IterationKernel<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_T, d_DeltaT, T0, Q_N_1, dx, k, rho, cp, alpha, dt, maxErr, maxIterNumber, N);
HeatEquation1DGPU_IterationSharedKernel<<<iDivUp(N, BLOCKSIZE), BLOCKSIZE>>>(d_T, d_DeltaT, T0, Q_N_1, dx, k, rho, cp, alpha, dt, maxErr, maxIterNumber, N);
thrust::device_ptr<float> d = thrust::device_pointer_cast(d_DeltaT);
current_max = thrust::reduce(d, d + N, current_max, thrust::maximum<float>());
// --- Increase iteration counter
(*Niter)++;
} while (*Niter < maxIterNumber && current_max > maxErr);
gpuErrchk(cudaFree(d_DeltaT));
}
/********/
/* MAIN */
/********/
int main()
{
// --- See https://en.wikipedia.org/wiki/Thermal_diffusivity
// --- Parameters of the problem
const float k = 0.19f; // --- Thermal conductivity [W / (m * K)]
const float rho = 930.f; // --- Density [kg / m^3]
const float cp = 1340.f; // --- Specific heat capacity [J / (kg * K)]
const float alpha = k / (rho * cp); // --- Thermal diffusivity [m^2 / s]
const float length = 1.6f; // --- Total length of the domain [m]
const int N = 64 * BLOCKSIZE; // --- Number of grid points
const float dx = (length / (float)(N - 1));// --- Discretization step [m]
const float dt = (float)(dx * dx / (4.f * alpha));
// --- Time step [s]
const float T0 = 0.f; // --- Temperature at the first end of the domain [C]
const float Q_N_1 = 10.f; // --- Heat flux at the second end of the domain [W / m^2]
const float maxErr = 1.0e-5f; // --- Maximum admitted DeltaT
const int maxIterNumber = 10.0 / dt; // --- Number of overall time steps
/********************/
/* GPU CALCULATIONS */
/********************/
float *h_T_final_device = (float *)malloc(N * sizeof(float)); // --- Final "host-side" result of GPU calculations
int Niter_GPU = 0; // --- Iteration counter for GPU calculations
// --- Device temperature allocation and initialization
float *d_T; gpuErrchk(cudaMalloc(&d_T, N * sizeof(float)));
gpuErrchk(cudaMemset(d_T, 0, N * sizeof(float)));
// --- GPU calculations
HeatEquation1DGPU(d_T, &Niter_GPU, T0, Q_N_1, dx, k, rho, cp, alpha, dt, maxErr, maxIterNumber, N);
// --- Transfer the GPU calculation results from device to host
gpuErrchk(cudaMemcpy(h_T_final_device, d_T, N * sizeof(float), cudaMemcpyDeviceToHost));
/********************/
/* CPU CALCULATIONS */
/********************/
// --- Host temperature allocation and initialization
float *h_T_final_host = (float *)malloc(N * sizeof(float));
memset(h_T_final_host, 0, N * sizeof(float));
int Niter_CPU = 0;
HeatEquation1DCPU(h_T_final_host, &Niter_CPU, T0, Q_N_1, dx, k, rho, cp, alpha, dt, maxErr, maxIterNumber, N);
/************************/
/* CHECKING THE RESULTS */
/************************/
for (int i = 0; i < N; i++) {
printf("Node = %i; T_host = %3.10f; T_device = %3.10f\n", i, h_T_final_host[i], h_T_final_device[i]);
if (h_T_final_host[i] != h_T_final_device[i]) {
printf("Error at i = %i; T_host = %f; T_device = %f\n", i, h_T_final_host[i], h_T_final_device[i]);
return 0;
}
}
printf("Test passed!\n");
delete [] h_T_final_device;
gpuErrchk(cudaFree(d_T));
return 0;
}

Reduction (finding min, max or sum of array) and Sorting are best examples of 1D problems. There can be many variables of these algorithms like sorting on structures etc

Given an integer, how do I find the next largest power of two using bit-twiddling?

If I have a integer number n, how can I find the next number k > n such that k = 2^i, with some i element of N by bitwise shifting or logic.
Example: If I have n = 123, how can I find k = 128, which is a power of two, and not 124 which is only divisible by two. This should be simple, but it eludes me.

For 32-bit integers, this is a simple and straightforward route:
unsigned int n;
n--;
n |= n >> 1; // Divide by 2^k for consecutive doublings of k up to 32,
n |= n >> 2; // and then or the results.
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n++; // The result is a number of 1 bits equal to the number
// of bits in the original number, plus 1. That's the
// next highest power of 2.
Here's a more concrete example. Let's take the number 221, which is 11011101 in binary:
n--; // 1101 1101 --> 1101 1100
n |= n >> 1; // 1101 1100 | 0110 1110 = 1111 1110
n |= n >> 2; // 1111 1110 | 0011 1111 = 1111 1111
n |= n >> 4; // ...
n |= n >> 8;
n |= n >> 16; // 1111 1111 | 1111 1111 = 1111 1111
n++; // 1111 1111 --> 1 0000 0000
There's one bit in the ninth position, which represents 2^8, or 256, which is indeed the next largest power of 2. Each of the shifts overlaps all of the existing 1 bits in the number with some of the previously untouched zeroes, eventually producing a number of 1 bits equal to the number of bits in the original number. Adding one to that value produces a new power of 2.
Another example; we'll use 131, which is 10000011 in binary:
n--; // 1000 0011 --> 1000 0010
n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011
n |= n >> 2; // 1100 0011 | 0011 0000 = 1111 0011
n |= n >> 4; // 1111 0011 | 0000 1111 = 1111 1111
n |= n >> 8; // ... (At this point all bits are 1, so further bitwise-or
n |= n >> 16; // operations produce no effect.)
n++; // 1111 1111 --> 1 0000 0000
And indeed, 256 is the next highest power of 2 from 131.
If the number of bits used to represent the integer is itself a power of 2, you can continue to extend this technique efficiently and indefinitely (for example, add a n >> 32 line for 64-bit integers).

There is actually a assembly solution for this (since the 80386 instruction set).
You can use the BSR (Bit Scan Reverse) instruction to scan for the most significant bit in your integer.
bsr scans the bits, starting at the
most significant bit, in the
doubleword operand or the second word.
If the bits are all zero, ZF is
cleared. Otherwise, ZF is set and the
bit index of the first set bit found,
while scanning in the reverse
direction, is loaded into the
destination register
(Extracted from: http://dlc.sun.com/pdf/802-1948/802-1948.pdf)
And than inc the result with 1.
so:
bsr ecx, eax //eax = number
jz #zero
mov eax, 2 // result set the second bit (instead of a inc ecx)
shl eax, ecx // and move it ecx times to the left
ret // result is in eax
#zero:
xor eax, eax
ret
In newer CPU's you can use the much faster lzcnt instruction (aka rep bsr). lzcnt does its job in a single cycle.

A more mathematical way, without loops:
public static int ByLogs(int n)
{
double y = Math.Floor(Math.Log(n, 2));
return (int)Math.Pow(2, y + 1);
}

Here's a logic answer:
function getK(int n)
{
int k = 1;
while (k < n)
k *= 2;
return k;
}

Here's John Feminella's answer implemented as a loop so it can handle Python's long integers:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
n -= 1 # greater than OR EQUAL TO n
shift = 1
while (n+1) & n: # n+1 is not a power of 2 yet
n |= n >> shift
shift <<= 1
return n + 1
It also returns faster if n is already a power of 2.
For Python >2.7, this is simpler and faster for most N:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
return 2**(n-1).bit_length()

This answer is based on constexpr to prevent any computing at runtime when the function parameter is passed as const
Greater than / Greater than or equal to
The following snippets are for the next number k > n such that k = 2^i
(n=123 => k=128, n=128 => k=256) as specified by OP.
If you want the smallest power of 2 greater than OR equal to n then just replace __builtin_clzll(n) by __builtin_clzll(n-1) in the following snippets.
C++11 using GCC or Clang (64 bits)
#include <cstdint> // uint64_t
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * 8 - __builtin_clzll(n));
}
Enhancement using CHAR_BIT as proposed by martinec
#include <cstdint>
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * CHAR_BIT - __builtin_clzll(n));
}
C++17 using GCC or Clang (from 8 to 128 bits)
#include <cstdint>
template <typename T>
constexpr T nextPowerOfTwo64 (T n)
{
T clz = 0;
if constexpr (sizeof(T) <= 32)
clz = __builtin_clzl(n); // unsigned long
else if (sizeof(T) <= 64)
clz = __builtin_clzll(n); // unsigned long long
else { // See https://stackoverflow.com/a/40528716
uint64_t hi = n >> 64;
uint64_t lo = (hi == 0) ? n : -1ULL;
clz = _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
return T{1} << (CHAR_BIT * sizeof(T) - clz);
}
Other compilers
If you use a compiler other than GCC or Clang, please visit the Wikipedia page listing the Count Leading Zeroes bitwise functions:
Visual C++ 2005 => Replace __builtin_clzl() by _BitScanForward()
Visual C++ 2008 => Replace __builtin_clzl() by __lzcnt()
icc => Replace __builtin_clzl() by _bit_scan_forward
GHC (Haskell) => Replace __builtin_clzl() by countLeadingZeros()
Contribution welcome
Please propose improvements within the comments. Also propose alternative for the compiler you use, or your programming language...
See also similar answers
nulleight's answer
ydroneaud's answer

Here's a wild one that has no loops, but uses an intermediate float.
// compute k = nextpowerof2(n)
if (n > 1)
{
float f = (float) n;
unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
k = t << (t < n);
}
else k = 1;
This, and many other bit-twiddling hacks, including the on submitted by John Feminella, can be found here.

assume x is not negative.
int pot = Integer.highestOneBit(x);
if (pot != x) {
pot *= 2;
}

If you use GCC, MinGW or Clang:
template <typename T>
T nextPow2(T in)
{
return (in & (T)(in - 1)) ? (1U << (sizeof(T) * 8 - __builtin_clz(in))) : in;
}
If you use Microsoft Visual C++, use function _BitScanForward() to replace __builtin_clz().

function Pow2Thing(int n)
{
x = 1;
while (n>0)
{
n/=2;
x*=2;
}
return x;
}

Bit-twiddling, you say?
long int pow_2_ceil(long int t) {
if (t == 0) return 1;
if (t != (t & -t)) {
do {
t -= t & -t;
} while (t != (t & -t));
t <<= 1;
}
return t;
}
Each loop strips the least-significant 1-bit directly. N.B. This only works where signed numbers are encoded in two's complement.

What about something like this:
int pot = 1;
for (int i = 0; i < 31; i++, pot <<= 1)
if (pot >= x)
break;

You just need to find the most significant bit and shift it left once. Here's a Python implementation. I think x86 has an instruction to get the MSB, but here I'm implementing it all in straight Python. Once you have the MSB it's easy.
>>> def msb(n):
... result = -1
... index = 0
... while n:
... bit = 1 << index
... if bit & n:
... result = index
... n &= ~bit
... index += 1
... return result
...
>>> def next_pow(n):
... return 1 << (msb(n) + 1)
...
>>> next_pow(1)
2
>>> next_pow(2)
4
>>> next_pow(3)
4
>>> next_pow(4)
8
>>> next_pow(123)
128
>>> next_pow(222)
256
>>>

Forget this! It uses loop !
unsigned int nextPowerOf2 ( unsigned int u)
{
unsigned int v = 0x80000000; // supposed 32-bit unsigned int
if (u < v) {
while (v > u) v = v >> 1;
}
return (v << 1); // return 0 if number is too big
}

private static int nextHighestPower(int number){
if((number & number-1)==0){
return number;
}
else{
int count=0;
while(number!=0){
number=number>>1;
count++;
}
return 1<<count;
}
}

// n is the number
int min = (n&-n);
int nextPowerOfTwo = n+min;

#define nextPowerOf2(x, n) (x + (n-1)) & ~(n-1)
or even
#define nextPowerOf2(x, n) x + (x & (n-1))

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Use of half2 in CUDA - cuda

Related

How to do dot product between matrices in caffe?

How to evaluate memory time and compute time for CUDA kernel?

Learning CUDA, but currently stuck

1D problems in CUDA and HPC

Given an integer, how do I find the next largest power of two using bit-twiddling?

Categories

Resources