cuda fortran cufftPlanMany - cuda

I am having troubles using cufftPlanMany. After creating the plans and taking the forward and inverse FFTs, I could not get the original data back. Please, find attached a minimum version of the code.
program test_cufft
use cudafor
use cufft
integer :: plan_r2c
integer :: plan_c2r
real,allocatable,dimension(:,:,:,:), device :: eta_d
complex,allocatable,dimension(:,:,:,:), device :: etak_d
nv = 4
nx = 256
ny = 512
nz = 512
nx21 = nx/2+1
allocate( eta_d(nv,nx,ny,nz) )
allocate( etak_d(nv,nx21,ny,nz) )
batch = nv;
rank = 3;
n = (/ nx, ny, nz /);
idist = nx*ny*nz;
odist = nx21*ny*nz;
inembed = (/ nx, ny, nz /);
onembed = (/ nx21, ny, nz /);
istride = 1;
ostride = 1;
istat = cufftPlanMany( plan_r2c, rank, n, inembed, istride, idist, &
onembed, ostride, odist, CUFFT_R2C, batch )
istat = cufftPlanMany( plan_c2r, rank, n, onembed, ostride, odist, &
inembed, istride, idist, CUFFT_C2R, batch )
! Initialize eta_d
istat = cufftExecR2C( plan_r2c, eta_d, etak_d )
istat = cufftExecC2R( plan_c2r, etak_d, eta_d )
eta_d = eta_d/idist
end program test_cufft
The problem is after I did the forward and inverse FFTs, I could not get the original data back.
Please, what am I doing wrong?
Should the ordering of data be eta_d(batch,nx,ny,nz) or eta_d(nx,ny,nz,batch)?

I would say the correct ordering is (nz, ny, nx, batch)
But it's important to relate these to your array indexing and storage order as well.
In CUFFT terminology, for a 3D transform(*) the nz direction is the fastest changing index, with typical usage (stride=1) being adjacent data in memory, corresponding to adjacent elements in a transform.
This direction (I think of it as the elements along a row, i.e. the "z" index being the column index) is also the direction of a multidimensional transform that gets "reduced" in the complex domain, for the R2C/C2R transform types.
With that in mind, I would rewrite your code this way:
$ cat t4.cuf
program test_cufft
use cudafor
use cufft
integer :: plan_r2c
integer :: plan_c2r
real,allocatable,dimension(:,:,:,:), managed :: eta_d
complex,allocatable,dimension(:,:,:,:), managed :: etak_d
integer :: n(3), inembed(3), onembed(3),rank,istride,idist,ostride,odist,batch
nv = 4
nx = 8
ny = 8
nz = 4
nz21 = nz/2+1
allocate( eta_d(nz,ny,nx,nv) )
allocate( etak_d(nz21,ny,nx,nv) )
batch = nv;
rank = 3;
n = (/ nx, ny, nz /);
idist = nx*ny*nz;
odist = nx*ny*nz21;
inembed = (/ nx, ny, nz /);
onembed = (/ nx, ny, nz21 /);
istride = 1;
ostride = 1;
istat = cufftPlanMany( plan_r2c, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch )
istat = cufftPlanMany( plan_c2r, rank, n, onembed, ostride, odist, inembed, istride, idist, CUFFT_C2R, batch )
! Initialize eta_d
eta_d(:,:,:,:) = 1.0
eta_d(1,1,1,2) = 2.0
istat = cufftExecR2C( plan_r2c, eta_d, etak_d )
istat = cudaDeviceSynchronize()
eta_d(:,:,:,:) = 0.0
istat = cufftExecC2R( plan_c2r, etak_d, eta_d )
istat = cudaDeviceSynchronize()
eta_d = eta_d/idist
print *,eta_d(1,1,1,1)
print *,eta_d(1,1,1,2)
end program test_cufft
$ nvfortran t4.cuf -lcufft
$ ./a.out
1.000000
2.000000
$
(NVIDIA HPC SDK 20.9, Tesla V100 GPU)
and it seems to give expected results for my simple test case.
(*) For a 2D transform, the ny dimension would be the most rapidly varying, and for a 1D transform the nx dimension (of course) is the most rapidly varying.
The multi-dimensional transforms and advanced data layout sections of the CUFFT manual may also be useful reading.

Related

Julia CUDA - Reduce matrix columns

Consider the following kernel, which reduces along the rows of a 2-D matrix
function row_sum!(x, ncol, out)
"""out = sum(x, dims=2)"""
row_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:ncol
#inbounds out[row_idx] += x[row_idx, i]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2*N)
out = CUDA.zeros(Float64, N)
#cuda threads=256 blocks=4 row_sum!(x, size(x)[2], out)
isapprox(out, sum(x, dims=2)) # true
How do I write a similar kernel except for reducing along the columns (of a 2-D matrix)? In particular, how do I get the index of each column, similar to how we got the index of each row with row_idx?
Here is the code:
function col_sum!(x, nrow, out)
"""out = sum(x, dims=1)"""
col_idx = (blockIdx().x-1) * blockDim().x + threadIdx().x
for i = 1:nrow
#inbounds out[col_idx] += x[i, col_idx]
end
return
end
N = 1024
x = CUDA.rand(Float64, N, 2N)
out = CUDA.zeros(Float64, 2N)
#cuda threads=256 blocks=8 col_sum!(x, size(x, 1), out)
And here is the test:
julia> isapprox(out, vec(sum(x, dims=1)))
true
As you can see the size of the result vector is now 2N instead of N, hence we had to adapt the number of blocks accordingly (that is multiply by 2 and now we have 8 instead of 4)
More materials can be found here: https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/

CUBLAS Sgemm confusing results

For two matrices X and Q of size 4x3 and 2x3
which in memory look like
x = [0 1 2 3 4 5 6 7 8 9 10 11]
q = [3 4 5 6 7 8]
I tried to use cublas multiplication cublasSgemm, but I couldn't manage to get expected results.
Since they are stored in row-major order so they should be interpreted as 3x4 and 3x2 so it seemed for me that
cublasSgemm(cublas_handle,
CUBLAS_OP_T, CUBLAS_OP_N,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
where
dim = 3
x_rows_num = 4
q_rows_num = 2
would work but in that case I got error
** On entry to SGEMM parameter number 8 had an illegal value
I also tried shuffling parameters a bit but I couldn't find any setup that would work.
So is it possible to multiply them without changing to column-major order?
EDIT:
So I got exepected results with changes made in this working example:
#include <cublas_v2.h>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
int main()
{
int x_rows_num = 4;
int q_rows_num = 2;
int dim = 3;
int N = x_rows_num*dim;
int M = q_rows_num*dim;
float *x, *q, *x_q_multiplication;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&q, M*sizeof(float));
cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);
for (int i = 0; i< N; i++) x[i] = i*1.0f;
for (int i = 0; i< M; i++) q[i] = (i + 3)*1.0f;
float *q_device;
cudaMallocManaged(&q_device, M*sizeof(float));
cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.f;
float beta = 0.f;
cublasSgemm(handle,
CUBLAS_OP_T, CUBLAS_OP_N,
x_rows_num, q_rows_num, dim,
&alpha,
x, dim,
q, dim,
&beta,
x_q_multiplication, x_rows_num);
cudaDeviceSynchronize();
for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " ";
cudaFree(x);
cudaFree(q);
cudaFree(x_q_multiplication);
return 0;
}
However I'am still not sure why dim became leading dimension
Your original CUBLAS call:
cublasSgemm(cublas_handle,
CUBLAS_OP_T, CUBLAS_OP_N,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
was close to correct. Your interpretation of what the leading dimensions should be was correct. What you got wrong was the Op specifiers. If both matrices are row major ordered and the first array needs to be read in its (row major) transposed order, then the operation should be:
#include <cublas_v2.h>
#include <cstring>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
int main()
{
int x_rows_num = 4;
int q_rows_num = 2;
int dim = 3;
int N = x_rows_num*dim;
int M = q_rows_num*dim;
float x0[12] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11};
float q0[6] = {3, 4, 5, 6, 7, 8 };
float *x, *q, *x_q_multiplication;
cudaMallocManaged(&x, N*sizeof(float));
cudaMallocManaged(&q, M*sizeof(float));
cudaMallocManaged(&x_q_multiplication, q_rows_num*x_rows_num*dim);
std::memcpy(x, x0, N*sizeof(float));
std::memcpy(q, q0, M*sizeof(float));
float *q_device;
cudaMallocManaged(&q_device, M*sizeof(float));
cudaMemcpy(q_device, q, M*sizeof(float), cudaMemcpyHostToDevice);
cublasHandle_t handle;
cublasCreate(&handle);
float alpha = 1.f;
float beta = 0.f;
cublasSgemm(handle,
CUBLAS_OP_N, CUBLAS_OP_T,
q_rows_num, x_rows_num, dim,
&alpha, // 1
q_device, q_rows_num,
x, x_rows_num,
&beta, // 0
x_q_multiplication, q_rows_num);
cudaDeviceSynchronize();
for (int i = 0; i < q_rows_num*x_rows_num; i++) std::cout << x_q_multiplication[i] << " "; std::cout << std::endl;
cudaFree(x);
cudaFree(q);
cudaFree(x_q_multiplication);
return 0;
}
which does this for me:
$ nvcc -arch=sm_52 cublas_trans.cu -o cublas_trans -lcublas
$ ./cublas_trans
76 88 91 106 106 124 121 142
and which I believe is the correct answer.
Incidentally, Robert Crovella's now deleted comment, which you say you take offense to was 100% correct. I suspect he read, as I did, your original CUBLAS call, interpreted the arguments and concluded, as I did, and as CUBLAS itself did, that you are trying to multiply a 3x4 matrix and a 3x2 matrix. Which is why the invalid argument error was raised.

cuFFT wrong results only when starting from complex

I was helped before in this answer to realise an in-place transform and it works well but ONLY if I start with real data. If I start with complex data, the results after IFT+FFT are wrong, and this happens only in the in-place version, I have perfect results with an out-of-place version of this transform.
This is the code:
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <complex.h>
#include <cuComplex.h>
#include <cufft.h>
#include <cufftXt.h>
#define N 4
#define N_PAD ( 2*(N/2+1) )
void print_3D_Real(double *array){
printf("\nPrinting 3D real matrix \n");
unsigned long int idx;
for (int z = 0; z < N; z++){
printf("---------------------------------------------------------------------------- plane %d below\n", z);
for (int x = 0; x < N; x++){
for (int y = 0; y < N; y++){
idx = z + N_PAD * (y + x * N);
printf("%.3f \t", array[idx]);
}
printf("\n");
}
}
}
void print_3D_Comp(cuDoubleComplex *array){
printf("\nPrinting 3D complex matrix \n");
unsigned long int idx;
for (int z = 0; z < (N/2+1); z++){
printf("---------------------------------------------------------------------------- plane %d below\n", z);
for (int x = 0; x < N; x++){
for (int y = 0; y < N; y++){
idx = z + (N/2+1) * (y + x * N);
printf("%+.3f%+.3fi \t", array[idx].x, array[idx].y);
}
printf("\n");
}
}
}
// Main function
int main(int argc, char **argv){
CU_ERR_CHECK( cudaSetDevice(0) );
unsigned long int idx, in_mem_size, out_mem_size;
cuDoubleComplex *in = NULL, *d_in = NULL;
double *out = NULL, *d_out = NULL;
cufftHandle plan_r2c, plan_c2r;
in_mem_size = sizeof(cuDoubleComplex) * N*N*(N/2+1);
out_mem_size = in_mem_size;
in = (cuDoubleComplex *) malloc (in_mem_size);
out = (double *) in;
cudaMalloc((void **)&d_in, in_mem_size);
d_out = (double *) d_in;
cufftPlan3d(&plan_c2r, N, N, N, CUFFT_Z2D);
cufftPlan3d(&plan_r2c, N, N, N, CUFFT_D2Z);
memset(in, 0, in_mem_size);
memset(out, 0, out_mem_size);
// Initial complex data
for (int x = 0; x < N; x++){
for (int y = 0; y < N; y++){
for (int z = 0; z < (N/2+1); z++){
idx = z + (N/2+1) * (y + x * N);
in[idx].x = idx;
}
}
}
print_3D_Comp(in);
cudaMemcpy(d_in, in, in_mem_size, cudaMemcpyHostToDevice);
cufftExecZ2D(plan_c2r, (cufftDoubleComplex *)d_in, (cufftDoubleReal *)d_out);
cudaMemcpy(out, d_out, out_mem_size, cudaMemcpyDeviceToHost);
// Normalisation
for (int i = 0; i < N*N*N_PAD; i++)
out[i] /= (N*N*N);
print_3D_Real(out);
cudaMemcpy(d_out, out, out_mem_size, cudaMemcpyHostToDevice);
cufftExecD2Z(plan_r2c, (cufftDoubleReal *)d_out, (cufftDoubleComplex *)d_in);
cudaMemcpy(in, d_in, in_mem_size, cudaMemcpyDeviceToHost) );
print_3D_Comp(in);
cudaDeviceReset();
return 0;
}
The output of my program is on this pastebin.
Can someone direct me on the right path? Thank you very much in advance.
First of all, your code doesn't compile.
In its most general definition, the fourier transform performs a mapping from one complex domain to another complex domain, and this operation should be reversible.
However, the C2R and R2C are special cases, with an assumption that the signal is completely representable in one of the 2 domains (the "time" domain) as a purely real signal (all imaginary components are zero).
However, it should be evident that there will be some complex "frequency" domain representations that cannot be represented by a purely real time domain signal. If the counter case were true (any complex frequency domain signal can be represented as a purely real time domain signal) then the FFT could not be reversible for a complex time domain signal (since all frequency domain data sets map to purely real time domain data sets.)
Therefore you cannot choose arbitrary data in the frequency domain, and expect it to map correctly into a purely real time domain signal. (*)
As a demonstration, change your input data set to the following:
in[idx].x = (idx)?0:1;
and I believe you will get a "passing" test case.
Furthermore, your allegation that "I have perfect results with an out-of-place version of this transform" I believe cannot be supported, if you are in fact using this particular data set as posted in your question. If you disagree, please post a complete code demonstrating your passing test case with the out-of-place transform, that is otherwise identical to your posted code.
Finally, we can test this with fftw. A conversion of your program to use fftw instead of cufft produces exactly the same output:
$ cat t355.cpp
#include <stdio.h>
#include <stdlib.h>
#include <fftw3.h>
#include <string.h>
#define N 4
#define N_PAD ( 2*(N/2+1) )
void print_3D_Real(double *array){
printf("\nPrinting 3D real matrix \n");
unsigned long int idx;
for (int z = 0; z < N; z++){
printf("---------------------------------------------------------------------------- plane %d below\n", z);
for (int x = 0; x < N; x++){
for (int y = 0; y < N; y++){
idx = z + N_PAD * (y + x * N);
printf("%.3f \t", array[idx]);
}
printf("\n");
}
}
}
void print_3D_Comp(fftw_complex *array){
printf("\nPrinting 3D complex matrix \n");
unsigned long int idx;
for (int z = 0; z < (N/2+1); z++){
printf("---------------------------------------------------------------------------- plane %d below\n", z);
for (int x = 0; x < N; x++){
for (int y = 0; y < N; y++){
idx = z + (N/2+1) * (y + x * N);
printf("%+.3f%+.3fi \t", array[idx][0], array[idx][1]);
}
printf("\n");
}
}
}
// Main function
int main(int argc, char **argv){
unsigned long int idx, in_mem_size, out_mem_size;
fftw_complex *in = NULL;
double *out = NULL;
in_mem_size = sizeof(fftw_complex) * N*N*(N/2+1);
out_mem_size = in_mem_size;
in = (fftw_complex *) malloc (in_mem_size);
out = (double *) in;
memset(in, 0, in_mem_size);
memset(out, 0, out_mem_size);
// Initial complex data
for (int x = 0; x < N; x++){
for (int y = 0; y < N; y++){
for (int z = 0; z < (N/2+1); z++){
idx = z + (N/2+1) * (y + x * N);
in[idx][0] = idx;
}
}
}
print_3D_Comp(in);
fftw_plan plan_c2r = fftw_plan_dft_c2r_3d(N, N, N, in, out, FFTW_ESTIMATE);
fftw_plan plan_r2c = fftw_plan_dft_r2c_3d(N, N, N, out, in, FFTW_ESTIMATE);
fftw_execute(plan_c2r);
// Normalisation
for (int i = 0; i < N*N*N_PAD; i++)
out[i] /= (N*N*N);
print_3D_Real(out);
fftw_execute(plan_r2c);
print_3D_Comp(in);
return 0;
}
$ g++ t355.cpp -o t355 -lfftw3
$ ./t355
Printing 3D complex matrix
---------------------------------------------------------------------------- plane 0 below
+0.000+0.000i +3.000+0.000i +6.000+0.000i +9.000+0.000i
+12.000+0.000i +15.000+0.000i +18.000+0.000i +21.000+0.000i
+24.000+0.000i +27.000+0.000i +30.000+0.000i +33.000+0.000i
+36.000+0.000i +39.000+0.000i +42.000+0.000i +45.000+0.000i
---------------------------------------------------------------------------- plane 1 below
+1.000+0.000i +4.000+0.000i +7.000+0.000i +10.000+0.000i
+13.000+0.000i +16.000+0.000i +19.000+0.000i +22.000+0.000i
+25.000+0.000i +28.000+0.000i +31.000+0.000i +34.000+0.000i
+37.000+0.000i +40.000+0.000i +43.000+0.000i +46.000+0.000i
---------------------------------------------------------------------------- plane 2 below
+2.000+0.000i +5.000+0.000i +8.000+0.000i +11.000+0.000i
+14.000+0.000i +17.000+0.000i +20.000+0.000i +23.000+0.000i
+26.000+0.000i +29.000+0.000i +32.000+0.000i +35.000+0.000i
+38.000+0.000i +41.000+0.000i +44.000+0.000i +47.000+0.000i
Printing 3D real matrix
---------------------------------------------------------------------------- plane 0 below
23.500 -1.500 -1.500 -1.500
-6.000 0.000 0.000 0.000
-6.000 0.000 0.000 0.000
-6.000 0.000 0.000 0.000
---------------------------------------------------------------------------- plane 1 below
-0.500 0.750 0.000 -0.750
3.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
-3.000 0.000 0.000 0.000
---------------------------------------------------------------------------- plane 2 below
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
---------------------------------------------------------------------------- plane 3 below
-0.500 -0.750 0.000 0.750
-3.000 0.000 0.000 0.000
0.000 0.000 0.000 0.000
3.000 0.000 0.000 0.000
Printing 3D complex matrix
---------------------------------------------------------------------------- plane 0 below
+0.000+0.000i +6.000+0.000i +6.000+0.000i +6.000+0.000i
+24.000+0.000i +30.000+0.000i +30.000+0.000i +30.000+0.000i
+24.000+0.000i +30.000+0.000i +30.000+0.000i +30.000+0.000i
+24.000+0.000i +30.000+0.000i +30.000+0.000i +30.000+0.000i
---------------------------------------------------------------------------- plane 1 below
+1.000+0.000i +4.000+0.000i +7.000+0.000i +10.000+0.000i
+13.000+0.000i +16.000+0.000i +19.000+0.000i +22.000+0.000i
+25.000+0.000i +28.000+0.000i +31.000+0.000i +34.000+0.000i
+37.000+0.000i +40.000+0.000i +43.000+0.000i +46.000+0.000i
---------------------------------------------------------------------------- plane 2 below
+2.000+0.000i +8.000+0.000i +8.000+0.000i +8.000+0.000i
+26.000+0.000i +32.000+0.000i +32.000+0.000i +32.000+0.000i
+26.000+0.000i +32.000+0.000i +32.000+0.000i +32.000+0.000i
+26.000+0.000i +32.000+0.000i +32.000+0.000i +32.000+0.000i
$
(*) You can argue, if you wish, that that the complex-conjugate symmetry feature of the C2R and R2C transforms should account for a correct mapping of all possible complex "frequency" domain signals into unique, purely real "time" domain signals. I claim, without proof, that it does not, with 2 data points:
The example code in this question.
Since the complex space in a C2R or R2C transform is numerically larger than the real space (by a factor of (2*(N/2+1))/N), it stands to reason that there cannot be a unique 1:1 mapping of all possible complex signals into unique real signals. And the unique 1:1 mapping would be necessary for full reversibility.
For additional background on the possibility of lack of symmetry in random data, note the discussion around CUFFT_COMPATIBILITY_FFTW_ASYMMETRIC in the cufft documentation.

Transpose matrix multiplication in cuBLAS howto

The problem is simple: I have two matrices, A and B, that are M by N, where M >> N. I want to first take the transpose of A, and then multiply that by B (A^T * B) to put that into C, which is N by N. I have everything set up for A and B, but how do I call cublasSgemm properly without it returning the wrong answer?
I understand that cuBlas has a cublasOperation_t enum for transposing things beforehand, but somehow I'm not quite using it correctly. My matrices A and B are in row-major order, i.e. [ row1 ][ row2 ][ row3 ]..... in device memory. That means for A to be interpreted as A-transposed, BLAS needs to know my A is in column-major order. My current code looks like below:
float *A, *B, *C;
// initialize A, B, C as device arrays, fill them with values
// initialize m = num_row_A, n = num_row_B, and k = num_col_A;
// set lda = m, ldb = k, ldc = m;
// alpha = 1, beta = 0;
// set up cuBlas handle ...
cublasSgemm(handle, CUBLAS_OP_T, CUBLAS_OP_N, m, n, k, &alpha, A, lda, B, ldb, &beta, C, ldc);
My questions:
Am I setting up m, k, n correctly?
What about lda, ldb, ldc?
Thanks!
Since cuBLAS always assume that the matrices are stored in column-major, you could either transpose your matrices first into colum-major by using cublas_geam(), or you could treat your matrix A, stored in row-major, as a new matrix AT stored in column-major. The matrix AT is actually the transpose of A. For B do the same thing. Then you could calculate matrix C stored in column-major by C=AT * BT^T
float* AT = A;
float* BT = B;
The leading dimension is a param related to the storage, which doesn't change no matter if you use the transpose flag CUBLAS_OP_T or not.
lda = num_col_A = num_row_AT = N;
ldb = num_col_B = num_row_BT = N;
ldc = num_row_C = N;
m and n in the cuBLAS GEMM routine are the #rows and #cols of the result matrix C,
m = num_row_C = num_row_AT = num_col_A = N;
n = num_col_C = num_row_BT = num_col_B = N;
k is the common dimension of A^T and B,
k = num_col_AT = num_row_B = M;
Then you could invoke the GEMM routine by
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, m, n, k, &alpha, AT, lda, BT, ldb, &beta, C, ldc);
If you want the matrix C to be stored in row-major, you could calculate the CT stored in column-major with the formula CT = BT * AT^T by
cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_T, n, m, k, &alpha, BT, ldb, AT, lda, &beta, CT, ldc);
Please note you don't have to swap m and n since C is a square matrix in this case.

Given an integer, how do I find the next largest power of two using bit-twiddling?

If I have a integer number n, how can I find the next number k > n such that k = 2^i, with some i element of N by bitwise shifting or logic.
Example: If I have n = 123, how can I find k = 128, which is a power of two, and not 124 which is only divisible by two. This should be simple, but it eludes me.
For 32-bit integers, this is a simple and straightforward route:
unsigned int n;
n--;
n |= n >> 1; // Divide by 2^k for consecutive doublings of k up to 32,
n |= n >> 2; // and then or the results.
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n++; // The result is a number of 1 bits equal to the number
// of bits in the original number, plus 1. That's the
// next highest power of 2.
Here's a more concrete example. Let's take the number 221, which is 11011101 in binary:
n--; // 1101 1101 --> 1101 1100
n |= n >> 1; // 1101 1100 | 0110 1110 = 1111 1110
n |= n >> 2; // 1111 1110 | 0011 1111 = 1111 1111
n |= n >> 4; // ...
n |= n >> 8;
n |= n >> 16; // 1111 1111 | 1111 1111 = 1111 1111
n++; // 1111 1111 --> 1 0000 0000
There's one bit in the ninth position, which represents 2^8, or 256, which is indeed the next largest power of 2. Each of the shifts overlaps all of the existing 1 bits in the number with some of the previously untouched zeroes, eventually producing a number of 1 bits equal to the number of bits in the original number. Adding one to that value produces a new power of 2.
Another example; we'll use 131, which is 10000011 in binary:
n--; // 1000 0011 --> 1000 0010
n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011
n |= n >> 2; // 1100 0011 | 0011 0000 = 1111 0011
n |= n >> 4; // 1111 0011 | 0000 1111 = 1111 1111
n |= n >> 8; // ... (At this point all bits are 1, so further bitwise-or
n |= n >> 16; // operations produce no effect.)
n++; // 1111 1111 --> 1 0000 0000
And indeed, 256 is the next highest power of 2 from 131.
If the number of bits used to represent the integer is itself a power of 2, you can continue to extend this technique efficiently and indefinitely (for example, add a n >> 32 line for 64-bit integers).
There is actually a assembly solution for this (since the 80386 instruction set).
You can use the BSR (Bit Scan Reverse) instruction to scan for the most significant bit in your integer.
bsr scans the bits, starting at the
most significant bit, in the
doubleword operand or the second word.
If the bits are all zero, ZF is
cleared. Otherwise, ZF is set and the
bit index of the first set bit found,
while scanning in the reverse
direction, is loaded into the
destination register
(Extracted from: http://dlc.sun.com/pdf/802-1948/802-1948.pdf)
And than inc the result with 1.
so:
bsr ecx, eax //eax = number
jz #zero
mov eax, 2 // result set the second bit (instead of a inc ecx)
shl eax, ecx // and move it ecx times to the left
ret // result is in eax
#zero:
xor eax, eax
ret
In newer CPU's you can use the much faster lzcnt instruction (aka rep bsr). lzcnt does its job in a single cycle.
A more mathematical way, without loops:
public static int ByLogs(int n)
{
double y = Math.Floor(Math.Log(n, 2));
return (int)Math.Pow(2, y + 1);
}
Here's a logic answer:
function getK(int n)
{
int k = 1;
while (k < n)
k *= 2;
return k;
}
Here's John Feminella's answer implemented as a loop so it can handle Python's long integers:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
n -= 1 # greater than OR EQUAL TO n
shift = 1
while (n+1) & n: # n+1 is not a power of 2 yet
n |= n >> shift
shift <<= 1
return n + 1
It also returns faster if n is already a power of 2.
For Python >2.7, this is simpler and faster for most N:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
return 2**(n-1).bit_length()
This answer is based on constexpr to prevent any computing at runtime when the function parameter is passed as const
Greater than / Greater than or equal to
The following snippets are for the next number k > n such that k = 2^i
(n=123 => k=128, n=128 => k=256) as specified by OP.
If you want the smallest power of 2 greater than OR equal to n then just replace __builtin_clzll(n) by __builtin_clzll(n-1) in the following snippets.
C++11 using GCC or Clang (64 bits)
#include <cstdint> // uint64_t
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * 8 - __builtin_clzll(n));
}
Enhancement using CHAR_BIT as proposed by martinec
#include <cstdint>
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * CHAR_BIT - __builtin_clzll(n));
}
C++17 using GCC or Clang (from 8 to 128 bits)
#include <cstdint>
template <typename T>
constexpr T nextPowerOfTwo64 (T n)
{
T clz = 0;
if constexpr (sizeof(T) <= 32)
clz = __builtin_clzl(n); // unsigned long
else if (sizeof(T) <= 64)
clz = __builtin_clzll(n); // unsigned long long
else { // See https://stackoverflow.com/a/40528716
uint64_t hi = n >> 64;
uint64_t lo = (hi == 0) ? n : -1ULL;
clz = _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
return T{1} << (CHAR_BIT * sizeof(T) - clz);
}
Other compilers
If you use a compiler other than GCC or Clang, please visit the Wikipedia page listing the Count Leading Zeroes bitwise functions:
Visual C++ 2005 => Replace __builtin_clzl() by _BitScanForward()
Visual C++ 2008 => Replace __builtin_clzl() by __lzcnt()
icc => Replace __builtin_clzl() by _bit_scan_forward
GHC (Haskell) => Replace __builtin_clzl() by countLeadingZeros()
Contribution welcome
Please propose improvements within the comments. Also propose alternative for the compiler you use, or your programming language...
See also similar answers
nulleight's answer
ydroneaud's answer
Here's a wild one that has no loops, but uses an intermediate float.
// compute k = nextpowerof2(n)
if (n > 1)
{
float f = (float) n;
unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
k = t << (t < n);
}
else k = 1;
This, and many other bit-twiddling hacks, including the on submitted by John Feminella, can be found here.
assume x is not negative.
int pot = Integer.highestOneBit(x);
if (pot != x) {
pot *= 2;
}
If you use GCC, MinGW or Clang:
template <typename T>
T nextPow2(T in)
{
return (in & (T)(in - 1)) ? (1U << (sizeof(T) * 8 - __builtin_clz(in))) : in;
}
If you use Microsoft Visual C++, use function _BitScanForward() to replace __builtin_clz().
function Pow2Thing(int n)
{
x = 1;
while (n>0)
{
n/=2;
x*=2;
}
return x;
}
Bit-twiddling, you say?
long int pow_2_ceil(long int t) {
if (t == 0) return 1;
if (t != (t & -t)) {
do {
t -= t & -t;
} while (t != (t & -t));
t <<= 1;
}
return t;
}
Each loop strips the least-significant 1-bit directly. N.B. This only works where signed numbers are encoded in two's complement.
What about something like this:
int pot = 1;
for (int i = 0; i < 31; i++, pot <<= 1)
if (pot >= x)
break;
You just need to find the most significant bit and shift it left once. Here's a Python implementation. I think x86 has an instruction to get the MSB, but here I'm implementing it all in straight Python. Once you have the MSB it's easy.
>>> def msb(n):
... result = -1
... index = 0
... while n:
... bit = 1 << index
... if bit & n:
... result = index
... n &= ~bit
... index += 1
... return result
...
>>> def next_pow(n):
... return 1 << (msb(n) + 1)
...
>>> next_pow(1)
2
>>> next_pow(2)
4
>>> next_pow(3)
4
>>> next_pow(4)
8
>>> next_pow(123)
128
>>> next_pow(222)
256
>>>
Forget this! It uses loop !
unsigned int nextPowerOf2 ( unsigned int u)
{
unsigned int v = 0x80000000; // supposed 32-bit unsigned int
if (u < v) {
while (v > u) v = v >> 1;
}
return (v << 1); // return 0 if number is too big
}
private static int nextHighestPower(int number){
if((number & number-1)==0){
return number;
}
else{
int count=0;
while(number!=0){
number=number>>1;
count++;
}
return 1<<count;
}
}
// n is the number
int min = (n&-n);
int nextPowerOfTwo = n+min;
#define nextPowerOf2(x, n) (x + (n-1)) & ~(n-1)
or even
#define nextPowerOf2(x, n) x + (x & (n-1))