Bug in the translation of a C file to a Cython file - cython

I've succeeded in converting a C file in cython. But the two codes are giving me different results and I really cannot find where the bug is.
The relevant C code of my script is the following:
double empirical_measure(int N, double K, double d, double T, double dt, FILE *store){
/*Define variables*/
int kt;
int kt_max = (int) ((double)T/(double)dt);
double *xt;
xt=(double *)malloc(N*sizeof(double));
double bruit;
double S=0.0;
double xtmp;
double xtilde;
double x_diff;
double xi;
int i;
int j;
int l;
/*initial condition*/
for(i=0; i<N; i++){
xt [i]=rand()/((double) RAND_MAX)*2*M_PI;
}
/*Compute trajectories and empirical measure*/
for(kt=0; kt<kt_max; kt++){
for(i=0; i<N; i++){
S = 0.0;
xi = xt[i];
for(j=0; j<N; j++){
x_diff = xt[j] - xi;
S = S + sin(x_diff);
}
bruit= d*sqrt(dt)*gaussian();
xtilde = xi + ((K/N)*S)*dt + bruit;
xt[i] = fmod(xtilde, 2.0*M_PI);
}
}
return 0;
}
The gaussian function is a function which gives a random number from a Normal(0,1) distribution.
I translated into this cython code (where the output is a matrix with all the computations of xt[:,k] for all k, and not, for each k, a vector as in the C code):
def simul(int N, double K, double d, double T, double dt):
cdef int kt_max = int(T/dt)
cdef double S1
cdef double xtilde, x_diff, xtmp
cdef double[:] bruit = d*sqrt(dt)*np.random.standard_normal(N) #bruit generator
cdef double[:, ::1] xt = np.zeros((N, kt_max), dtype=np.float64)
cdef int kt, i, j, k
#initial conditions
X=np.random.uniform(0,2*np.pi,N)
for i in range(N):
xt[i,0]=X[i]
#Compute trajectories and empirical measure
for kt in range(kt_max-1):
for i in range(N):
S1 = 0.0
for j in range(N):
x_diff = xt[j,kt] - xt[i,kt]
S1 = S1 + sin(x_diff)
xtilde = xt[i,kt] + ((K/N)*S1)*dt + bruit[i]
xt[i,kt+1] = xtilde%(2*np.pi)
return xt
The problem is that if I run the two scripts with the same values I have really different results. For example, given:
N=600
K=5
T=2
dt=0.01
d=1
I obtain for the C code, for the last k:
where from the cython code:
I really can't find the problem in the code. Where could a bug be present?
Update
If I run the code with d=0 (which means that the "bruit part" can be neglected) I still obtain different results:
Histogram for d=0
The blu is the C simulation and the other colors are three simulations from python.
This means that there's something wrong in this section:
for kt in range(kt_max-1):
for i in range(N):
S1 = 0.0
for j in range(N):
x_diff = xt[j,kt] - xt[i,kt]
S1 = S1 + sin(x_diff)
xtilde = xt[i,kt] + ((K/N)*S1)*dt
xt[i,kt+1] = xtilde%(2*np.pi)
Any ideas? Has the function sin some tricky argument to put in the code?
I've also done the computation of the sin sum, getting:
sin sum simulations
(black is C code, the rest is python simulations)

Related

cython code using prange runs unexpectedly only on a single thread

I have a cython code that I try to parallelize using a prange cython command. The code compiles but when I run it, it runs only on a single thread/core. I read in other posts that most of the cases this was due to the gil which was not properly released but when I look at my code I do not see where this happens. Would you have any idea about what is wrong with my code ?
UPDATE:
compiler: gcc 7.5
cython: 0.29.21
OS: ubuntu 20.04
Cython code:
import cython
from cython.parallel import prange
cimport numpy as cnp
import numpy as np
cdef extern from "math.h" nogil:
double floor(double x)
double ceil(double x)
double sqrt(double x)
cdef inline double round(double r) nogil:
return floor(r + 0.5) if (r > 0.0) else ceil(r - 0.5)
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
cdef int atoms_in_shell_inner(double[:,:] coords, int[:,:] indexes, double[:,:] cell, double[:,:] rcell, double[:] boxed_center, double r2, int mol_idx) nogil:
cdef double r, rx, ry, rz, sdx, sdy, sdz, x, y, z, x_boxed, y_boxed, z_boxed
cdef int at_idx, i
# Loop over the selected atoms j of molecule i
for 0 <= i < indexes.shape[0]:
if indexes[mol_idx,i] == -1:
return 0
at_idx = indexes[mol_idx,i]
x = coords[at_idx,0]
y = coords[at_idx,1]
z = coords[at_idx,2]
# Convert real coordinates to box coordinates
x_boxed = x*rcell[0,0] + y*rcell[0,1] + z*rcell[0,2]
y_boxed = x*rcell[1,0] + y*rcell[1,1] + z*rcell[1,2]
z_boxed = x*rcell[2,0] + y*rcell[2,1] + z*rcell[2,2]
sdx = x_boxed - boxed_center[0]
sdy = y_boxed - boxed_center[1]
sdz = z_boxed - boxed_center[2]
# Apply the PBC to the box coordinates distance vector between atom j and the center of the shell
sdx -= round(sdx)
sdy -= round(sdy)
sdz -= round(sdz)
# Convert back the box coordinates distance vector to real coordinates distance vector
rx = sdx*cell[0,0] + sdy*cell[0,1] + sdz*cell[0,2]
ry = sdx*cell[1,0] + sdy*cell[1,1] + sdz*cell[1,2]
rz = sdx*cell[2,0] + sdy*cell[2,1] + sdz*cell[2,2]
# Compute the squared norm of the distance vector in real coordinates
r = rx*rx + ry*ry + rz*rz
# If the distance is below the cutoff mark the molecule i as being in the shell
if r < r2:
return 1
return 0
#cython.cdivision(True)
#cython.boundscheck(False)
#cython.wraparound(False)
def atoms_in_shell(double[:,:] coords,
double[:,:] cell,
double[:,:] rcell,
int[:,:] indexes,
cnp.int32_t center,
cnp.float64_t radius):
cdef int i, n_molecules
cdef double[:] shell_center = coords[center,:]
cdef double[:] boxed_center = np.empty(3,dtype=np.float)
cdef int[:] in_shell = np.zeros(indexes.shape[0],dtype=np.int32)
n_molecules = indexes.shape[0]
boxed_center[0] = shell_center[0]*rcell[0,0] + shell_center[1]*rcell[0,1] + shell_center[2]*rcell[0,2]
boxed_center[1] = shell_center[0]*rcell[1,0] + shell_center[1]*rcell[1,1] + shell_center[2]*rcell[1,2]
boxed_center[2] = shell_center[0]*rcell[2,0] + shell_center[1]*rcell[2,1] + shell_center[2]*rcell[2,2]
# Loop over the molecules
for i in prange(n_molecules,nogil=True):
in_shell[i] = atoms_in_shell_inner(coords, indexes, cell, rcell, boxed_center, radius*radius, i)
return in_shell.base
setup.py file:
from Cython.Distutils import build_ext
from distutils.core import setup, Extension
import numpy
INCLUDE_DIR = [numpy.get_include()]
EXTENSIONS = [Extension('atoms_in_shell',
include_dirs=INCLUDE_DIR,
sources=["atoms_in_shell.pyx"],
extra_compile_args = ["-O3", "-ffast-math", "-march=native", "-fopenmp" ],
extra_link_args=['-fopenmp']),
]
setup(ext_modules=EXTENSIONS, cmdclass={'build_ext': build_ext})
Python code:
from atoms_in_shell import atoms_in_shell
import numpy as np
coords = np.random.uniform(-1000.0,1000.0,(500000000,3))
cell = np.identity(3)
rcell = np.identity(3)
indexes = np.empty((10000,6),dtype=np.int32)
indexes.fill(-1)
for row in indexes:
n_atoms = np.random.randint(1,6)
row[:n_atoms] = np.random.choice(coords.shape[0]-1,n_atoms)
print(atoms_in_shell(coords, cell, rcell, indexes, 5, 1))

Created Shared Memory Code with Python Cuda

I'm struggling to get some code running to explore the shared memory features to get a fast matrix multiply. But everytime I try this I seem to run into errors that I cannot fathom.
import numpy as np
from numba import cuda, types
m = 128
n = 32
a = np.arange(m*n).reshape(m,n).astype(np.int32)
b = np.arange(m*n).reshape(n,m).astype(np.int32)
c = np.zeros((m, n)).astype(np.int32)
d_a = cuda.to_device(a)
d_b = cuda.to_device(b)
d_c = cuda.to_device(c)
block_size = (m,n)
grid_size = (int(m/n),int(m/n))
#cuda.jit
def mm(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.y, cuda.threadIdx.x] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[column, row]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
c[row][column] = sum
and testing
mm[grid_size, block_size](d_a, d_b, d_c)
solution = a#b
output = d_c.copy_to_host()
keeps resulting in the following error:
CudaAPIError: [700] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR
After chatting with the provider of one answer, I've updated the function. But still cannot make this work. So for the computation of the sum for each element in the output c we need to loop over the columns of A and the rows of B, using i as the index. We have therefore n*n products. I think the i us correct in the sum, but I cannot seem to get the correct index for the row and column of a and b in the expression for the sum.
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
column, row = cuda.grid(2)
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, column]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[row, column]
cuda.syncthreads()
for i in range(a.shape[1]):
sum += a_cache[cuda.threadIdx.x, i] * b_cache[i, cuda.threadIdx.y]
c[row][column] = sum
Your block size is invalid. CUDA devices have a limit of 1024 threads per block. When I run your code I see this:
/opt/miniconda3/lib/python3.7/site-packages/numba/cuda/cudadrv/driver.py in _check_error(self, fname, retcode)
327 _logger.critical(msg, _getpid(), self.pid)
328 raise CudaDriverError("CUDA initialized before forking")
--> 329 raise CudaAPIError(retcode, msg)
330
331 def get_device(self, devnum=0):
CudaAPIError: [1] Call to cuLaunchKernel results in CUDA_ERROR_INVALID_VALUE
When I fix that I see this:
$ cuda-memcheck python somethingsometing.py
========= CUDA-MEMCHECK
========= Invalid __shared__ read of size 4
========= at 0x000008b0 in cudapy::__main__::mm$241(Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>, Array<int, int=2, A, mutable, aligned>)
========= by thread (15,11,0) in block (3,2,0)
========= Address 0x00000ec0 is out of bounds
The why is pretty obvious:
for i in range(a.shape[1]):
sum += a_cache[row][i] * b_cache[i][column]
row and column are dimensions in the execution grid, not the local share memory tile, and similarly i is bounded by the shape of a, not the shape of a_cache (note also that you seemed to lapse in C style 2D array indexing syntax about half way through the code, which is a potential bug if you don't understand the difference between the two in Python).
To fix it you will have to change the indexing and then implement the rest of the code for multiplication (i.e. you must iteratively load the whole row and column slices through the local shared tiles to compute the full dot product for each row/column pair which a block will process).
Note also that
The dimensions you have selected for c are wrong (should be m x m)
The grid size you run the kernel on is also wrong because the dimensions of C are wrong and so your code could never calculate the whole matrix
Even after fixing all of this, it is likely that the results of the multiplication will be incorrect at anything other than trivial sizes because of integer overflow.
#disruptive: Hi, did you find any solution to your problem?
I had the same problem as you but I solved it by restarting the kernel of Jupyter notebook.
My code is slightly different than yours:
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
col, row = cuda.grid(2)
row = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
col = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row][col]
b_cache[cuda.threadIdx.y, cuda.threadIdx.x] = b[col][row]
for i in range(a.shape[1]):
a_cache[cuda.threadIdx.x, cuda.threadIdx.y] = a[row, cuda.threadIdx.y + i * N]
b_cache[cuda.threadIdx.x, cuda.threadIdx.y] = b[cuda.threadIdx.x + i * N, col]
cuda.syncthreads()
for j in range(N):
sum += a_cache[cuda.threadIdx.x, j] * b_cache[j, cuda.threadIdx.y]
# Wait until all threads finish computing
cuda.syncthreads()
c[row][col] = sum
Please let me know if you have any update.
This is the correct solution:
import numpy as np
from numba import cuda, types
#cuda.jit
def mm_shared(a, b, c):
sum = 0
# `a_cache` and `b_cache` are already correctly defined
a_cache = cuda.shared.array(block_size, types.int32)
b_cache = cuda.shared.array(block_size, types.int32)
# TODO: use each thread to populate one element each a_cache and b_cache
x,y = cuda.grid(2)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
bpg = cuda.gridDim.x
TPB = int(N)
for i in range(a.shape[1] / TPB):
a_cache[tx, ty] = a[x, ty + i * TPB]
b_cache[tx, ty] = b[tx + i * TPB, y]
cuda.syncthreads()
for j in range(TPB):#a.shape[1]):
# TODO: calculate the `sum` value correctly using values from the cache
sum += a_cache[tx][j] * b_cache[j][ty]
cuda.syncthreads()
c[x][y] = sum

How to avoid append when computing the intersection of two lists?

I am writing a function to compute the intersection between two sorted arrays (which may contain duplicates). So if the input is [0,3,7,7,7,9, 12] and [2,7,7,8, 12] the output should be [7,7,12] for example.
Here is my code:
cimport cython
#cython.wraparound(False)
#cython.cdivision(True)
#cython.boundscheck(False)
def sorting(int[:] A, int[:] B):
cdef Py_ssize_t i = 0
cdef Py_ssize_t j = 0
cdef int lenA = A.shape[0]
cdef int lenB = B.shape[0]
intersect = []
while (i < lenA and j < lenB):
if A[i] == B[j]:
intersect.append(A[i])
i += 1
j += 1
elif A[i] > B[j]:
j += 1
elif A[i] < B[j]:
i += 1
return intersect
As you will see, I use a list to store the answers and append to add the answers as they arrive. I am happy to return a python or numpy array if that will speed things up.
How can I avoid append to speed up the cython?
For this kind of thing you usually want to pre-allocate the array (it's basically free to shrink it later). In this case it can't be longer than the shortest of your input arrays, so that gives you a starting size:
cdef int[::1] intersect = np.array([A.shape[0] if A.shape[0]<B.shape[0] else B.shape[0]],dtype=np.int)
You then just keep a running total of how what index you're at on that array (say k), so append is replaced by:
intersect[k] = A[i]
k += 1
At the end you can either return the memoryview intersect[:k] or convert it to a numpy array with np.asarray(intersect[:k]).
As an aside: I'd remove the Cython directive #cython.cdivision(True) since you aren't doing any division. I believe you should be thinking about whether these directives are useful and if they apply to your code rather than blindly copying them in out of habit.

is it data race in nested thrust functor

I have tested this snippet and try to explain its cause as well as a way to resolve it, but have failed to do so
#include <thrust/inner_product.h>
#include <thrust/functional.h>
#include <thrust/device_vector.h>
#include <thrust/random.h>
#include <thrust/execution_policy.h>
#include <iostream>
#include <cmath>
#include <boost/concept_check.hpp>
struct alter_tuple {
alter_tuple(const int& a_, const int& b_) : a(a_), b(b_){};
__host__ __device__
thrust::tuple<int,int> operator()(thrust::tuple<int,int> X)
{
int Xx = thrust::get<0>(X);
int Xy = thrust::get<1>(X);
int Xpx = a*Xx-b*Xy;
int Xpy = -b*Xx+a*Xy;
printf("in (%d,%d) -> (%d,%d)\n",Xx,Xy,Xpx,Xpy);
return thrust::make_tuple(Xpx,Xpy);
}
int a; // these variables a,b are shared between different threads used by this functor kernel
int b; // which easily creates racing problem
};
struct alter_tuple_arr {
alter_tuple_arr(int* a_, int* b_, int* c_, int* d_) : a(a_), b(b_), c(c_), d(d_) {};
__host__ __device__
thrust::tuple<int,int> operator()(const int& idx)
{
int Xx = a[idx];
int Xy = b[idx];
int Xpx = a[idx]*Xx-b[idx]*Xy;
int Xpy = -b[idx]*Xx+a[idx]*Xy;
printf("in (%d,%d) -> (%d,%d)\n",Xx,Xy,Xpx,Xpy);
return thrust::make_tuple(Xpx,Xpy);
}
int* a;
int* b;
int* c;
int* d;
};
struct bFuntor
{
bFuntor(int* av__, int* bv__, int* cv__, int* dv__, const int& N__) : av_(av__), bv_(bv__), cv_(cv__), dv_(dv__), N_(N__) {};
__host__ __device__
int operator()(const int& idx)
{
thrust::device_ptr<int> av_dpt = thrust::device_pointer_cast(av_);
thrust::device_ptr<int> av_dpt1 = thrust::device_pointer_cast(av_+N_);
thrust::device_ptr<int> bv_dpt = thrust::device_pointer_cast(bv_);
thrust::device_ptr<int> bv_dpt1 = thrust::device_pointer_cast(bv_+N_);
thrust::device_ptr<int> cv_dpt = thrust::device_pointer_cast(cv_);
thrust::device_ptr<int> cv_dpt1 = thrust::device_pointer_cast(cv_+N_);
thrust::device_ptr<int> dv_dpt = thrust::device_pointer_cast(dv_);
thrust::device_ptr<int> dv_dpt1 = thrust::device_pointer_cast(dv_+N_);
thrust::detail::normal_iterator<thrust::device_ptr<int>> a0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(av_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> a1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(av_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> b0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(bv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> b1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(bv_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> c0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(cv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> c1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(cv_dpt1);
thrust::detail::normal_iterator<thrust::device_ptr<int>> d0 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(dv_dpt);
thrust::detail::normal_iterator<thrust::device_ptr<int>> d1 = thrust::detail::make_normal_iterator<thrust::device_ptr<int>>(dv_dpt1);
// ** alter_tuple is WRONG
#define WRONG
#ifdef WRONG
thrust::transform(thrust::device,
thrust::make_zip_iterator(thrust::make_tuple(a0,b0)),
thrust::make_zip_iterator(thrust::make_tuple(a1,b1)),
// thrust::make_zip_iterator(thrust::make_tuple(cv_dpt,dv_dpt)), // cv_dpt
thrust::make_zip_iterator(thrust::make_tuple(c0,d0)), // cv_dpt
alter_tuple(cv_[idx],dv_[idx]));
#endif
#ifdef RIGHT
// ** alter_tuple_arr is CORRECT way to do it
thrust::transform(thrust::device,
thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(N_),
// thrust::make_zip_iterator(thrust::make_tuple(cv_dpt,dv_dpt)), // cv_dpt
thrust::make_zip_iterator(thrust::make_tuple(c0,d0)), // cv_dpt
alter_tuple_arr(av_,bv_,cv_,dv_));
#endif
for (int i=0; i<N_; i++)
printf("out: (%d,%d) -> (%d,%d)\n",av_[i],bv_[i],cv_[i],dv_[i]);
return cv_dpt[idx];
}
int* av_;
int* bv_;
int* cv_;
int* dv_;
int N_;
float af; // are these variables host side or device side??
};
__host__ __device__
unsigned int hash(unsigned int a)
{
a = (a+0x7ed55d16) + (a<<12);
a = (a^0xc761c23c) ^ (a>>19);
a = (a+0x165667b1) + (a<<5);
a = (a+0xd3a2646c) ^ (a<<9);
a = (a+0xfd7046c5) + (a<<3);
a = (a^0xb55a4f09) ^ (a>>16);
return a;
}
int main(void)
{
int N = 10;
std::vector<int> av,bv,cv,dv;
unsigned int seed = hash(10);
thrust::default_random_engine rng(seed);
thrust::uniform_real_distribution<float> u01(0,10);
for (int i=0;i<N;i++) {
av.push_back((int)u01(rng));
bv.push_back((int)u01(rng));
cv.push_back((int)u01(rng));
dv.push_back((int)u01(rng));
// printf("%d %d %d %d \n",av[i],bv[i],cv[i],dv[i]);
}
thrust::device_vector<int> av_d(N);
thrust::device_vector<int> bv_d(N);
thrust::device_vector<int> cv_d(N);
thrust::device_vector<int> dv_d(N);
av_d = av; bv_d = bv; cv_d = cv; dv_d = dv;
thrust::transform(thrust::counting_iterator<int>(0),
thrust::counting_iterator<int>(N),
cv_d.begin(),
bFuntor(thrust::raw_pointer_cast(av_d.data()),
thrust::raw_pointer_cast(bv_d.data()),
thrust::raw_pointer_cast(cv_d.data()),
thrust::raw_pointer_cast(dv_d.data()),
N));
thrust::host_vector<int> bv_h(N);
thrust::copy(bv_d.begin(), bv_d.end(), bv_h.begin()); // probably I forgot this! to copy back the result from device to host!
return 0;
}
In this nested thrust calls, two nested functors were tested, one of them worked (one with "#define RIGHT"). In the case of WRONG functor i.e. alter_tuple:
where do two variables int a, int b reside? host or device? or local kernel registers? or they are shared between threads of this functor's operator?
Inside, the alter_tuple functor, I tried to print out the result (int printf("in...")) and this is correct calculation. However, when this result is returned to caller functor and is printed out (in printf("out....")), they are incorrect and are different with its previous calculation
how come can these results are different? I can't seem to explain it and there is no documents or example to refer to
this difference is shown in output here
Edit 1:
minimum size test code shows functors (literally, a*x = y) in both cases receive/initialize values correctly SO_example_no_tuple_arr_wo_c.cu
print out is:
out: 9*8 -> 72
out: 9*8 -> 72
out: 9*8 -> 72
out: 6*4 -> 24
out: 6*4 -> 24
out: 6*4 -> 24
out: 1*8 -> 8
out: 1*8 -> 8
out: 1*6 -> 6
out: 9*1 -> 9
out: 9*1 -> 9
which shows the correct received values
minimum test code without using pointer/array to pass input values shows that regardless of input values are correctly initialized, the return results are wrong SO_example_no_tuple.cu
its output in case N=2:
in 9*8 -> 72
in 6*4 -> 24
in 9*8 -> 72
in 6*4 -> 24
out: 9*8 -> 24
out: 9*8 -> 24
out: 6*4 -> 24
out: 6*4 -> 24
The difference in values is not strictly due to a data race problem.
Your two approaches do not do the same thing, and it has to do with the values of a and b that will be selected for each invocation of the nested thrust::transform call. This is evident if you set N = 1, which should remove any concerns about data racing. The results are still different.
In the "failing" case, you are invoking the alter_tuple() operator like so:
thrust::transform(thrust::device,
...
alter_tuple(cv_[idx],dv_[idx]));
These values (cv_[idx], dv_[idx]) then become your initializing parameters ending up in a and b variables inside the functor. But your "passing" case is effectively initializing these variables differently, using a[idx] and b[idx], which correspond to av_[idx] and bv_[idx]. If we change the alter_tuple invocation to use a and b:
alter_tuple(av_[idx],bv_[idx]));
then the N = 1 case results now match. This was easier to understand because we had in fact only one entry in the a, b, c, d vectors.
When we expand to the N = 10 case, however, we no longer get matching results. To explain why, we need to understand the use of a and b inside the functor in this case. In the "failing" case, we are passing a single initializing value for each of a and b as used in the functor:
alter_tuple(av_[idx],bv_[idx]));
so, for a given thread, which means for a given invocation of the nested thrust::transform call, a single value will be used for a and b:
alter_tuple(const int& a_, const int& b_) : a(a_), b(b_){};
...
int a; // these values are constant across variation of "idx"
int b; // passed to the functor
on the other hand, in the "passing" case, the a and b values will vary for each element passed to the functor, within the nested transform call:
thrust::tuple<int,int> operator()(const int& idx)
{
int Xx = a[idx]; // these values of a and b *vary* for each idx
int Xy = b[idx]; // passed to the functor
Once that is understood, if the "passing" case is the desired case, then I have no idea how to transform the first case to produce passing results, as there is no way you can cause a single initializing value to take on the behavior of the varying values for a and b in the "passing" case.
None of the above involves data racing, but since your operations (i.e. each thread) is writing to every value of c and d, I don't think this overall approach makes any sense, and I'm not sure what you are trying to accomplish. I think if you expanded this to more elements/threads, then you could certainly experience unpredictable/variable results.
To answer some of your other questions, the variables a and b end up as thread-local variables, on the device. So each data member in either functor is a thread-local variable on the device.
Inside, the alter_tuple functor, I tried to print out the result (int printf("in...")) and this is correct calculation. However, when this result is returned to caller functor and is printed out (in printf("out....")), they are incorrect and are different with its previous calculation
Each thread is writing to the same locations in the c and d vector. Therefore, since each thread writes to the entire vector, but (in the failing case) each thread uses a different initializing value for a and b inside the functor, it stands to reason that each thread will compute a different result for the values of c and d, and the results you get after completion of the thrust call will depend on which thread "wins" the output write operation. This is unpredictable, and certainly not all threads printout will match the final result, because each thread will compute different values for c and d.

How can I reverse the ON bits in a byte?

I was reading Joel's book where he was suggesting as interview question:
Write a program to reverse the "ON" bits in a given byte.
I only can think of a solution using C.
Asking here so you can show me how to do in a Non C way (if possible)
I claim trick question. :) Reversing all bits means a flip-flop, but only the bits that are on clearly means:
return 0;
What specifically does that question mean?
Good question. If reversing the "ON" bits means reversing only the bits that are "ON", then you will always get 0, no matter what the input is. If it means reversing all the bits, i.e. changing all 1s to 0s and all 0s to 1s, which is how I initially read it, then that's just a bitwise NOT, or complement. C-based languages have a complement operator, ~, that does this. For example:
unsigned char b = 102; /* 0x66, 01100110 */
unsigned char reverse = ~b; /* 0x99, 10011001 */
What specifically does that question mean?
Does reverse mean setting 1's to 0's and vice versa?
Or does it mean 00001100 --> 00110000 where you reverse their order in the byte? Or perhaps just reversing the part that is from the first 1 to the last 1? ie. 00110101 --> 00101011?
Assuming it means reversing the bit order in the whole byte, here's an x86 assembler version:
; al is input register
; bl is output register
xor bl, bl ; clear output
; first bit
rcl al, 1 ; rotate al through carry
rcr bl, 1 ; rotate carry into bl
; duplicate above 2-line statements 7 more times for the other bits
not the most optimal solution, a table lookup is faster.
Reversing the order of bits in C#:
byte ReverseByte(byte b)
{
byte r = 0;
for(int i=0; i<8; i++)
{
int mask = 1 << i;
int bit = (b & mask) >> i;
int reversedMask = bit << (7 - i);
r |= (byte)reversedMask;
}
return r;
}
I'm sure there are more clever ways of doing it but in that precise case, the interview question is meant to determine if you know bitwise operations so I guess this solution would work.
In an interview, the interviewer usually wants to know how you find a solution, what are you problem solving skills, if it's clean or if it's a hack. So don't come up with too much of a clever solution because that will probably mean you found it somewhere on the Internet beforehand. Don't try to fake that you don't know it neither and that you just come up with the answer because you are a genius, this is will be even worst if she figures out since you are basically lying.
If you're talking about switching 1's to 0's and 0's to 1's, using Ruby:
n = 0b11001100
~n
If you mean reverse the order:
n = 0b11001100
eval("0b" + n.to_s(2).reverse)
If you mean counting the on bits, as mentioned by another user:
n = 123
count = 0
0.upto(8) { |i| count = count + n[i] }
♥ Ruby
I'm probably misremembering, but I
thought that Joel's question was about
counting the "on" bits rather than
reversing them.
Here you go:
#include <stdio.h>
int countBits(unsigned char byte);
int main(){
FILE* out = fopen( "bitcount.c" ,"w");
int i;
fprintf(out, "#include <stdio.h>\n#include <stdlib.h>\n#include <time.h>\n\n");
fprintf(out, "int bitcount[256] = {");
for(i=0;i<256;i++){
fprintf(out, "%i", countBits((unsigned char)i));
if( i < 255 ) fprintf(out, ", ");
}
fprintf(out, "};\n\n");
fprintf(out, "int main(){\n");
fprintf(out, "srand ( time(NULL) );\n");
fprintf(out, "\tint num = rand() %% 256;\n");
fprintf(out, "\tprintf(\"The byte %%i has %%i bits set to ON.\\n\", num, bitcount[num]);\n");
fprintf(out, "\treturn 0;\n");
fprintf(out, "}\n");
fclose(out);
return 0;
}
int countBits(unsigned char byte){
unsigned char mask = 1;
int count = 0;
while(mask){
if( mask&byte ) count++;
mask <<= 1;
}
return count;
}
The classic Bit Hacks page has several (really very clever) ways to do this, but it's all in C. Any language derived from C syntax (notably Java) will likely have similar methods. I'm sure we'll get some Haskell versions in this thread ;)
byte ReverseByte(byte b)
{
return b ^ 0xff;
}
That works if ^ is XOR in your language, but not if it's AND, which it often is.
And here's a version directly cut and pasted from OpenJDK, which is interesting because it involves no loop. On the other hand, unlike the Scheme version I posted, this version only works for 32-bit and 64-bit numbers. :-)
32-bit version:
public static int reverse(int i) {
// HD, Figure 7-1
i = (i & 0x55555555) << 1 | (i >>> 1) & 0x55555555;
i = (i & 0x33333333) << 2 | (i >>> 2) & 0x33333333;
i = (i & 0x0f0f0f0f) << 4 | (i >>> 4) & 0x0f0f0f0f;
i = (i << 24) | ((i & 0xff00) << 8) |
((i >>> 8) & 0xff00) | (i >>> 24);
return i;
}
64-bit version:
public static long reverse(long i) {
// HD, Figure 7-1
i = (i & 0x5555555555555555L) << 1 | (i >>> 1) & 0x5555555555555555L;
i = (i & 0x3333333333333333L) << 2 | (i >>> 2) & 0x3333333333333333L;
i = (i & 0x0f0f0f0f0f0f0f0fL) << 4 | (i >>> 4) & 0x0f0f0f0f0f0f0f0fL;
i = (i & 0x00ff00ff00ff00ffL) << 8 | (i >>> 8) & 0x00ff00ff00ff00ffL;
i = (i << 48) | ((i & 0xffff0000L) << 16) |
((i >>> 16) & 0xffff0000L) | (i >>> 48);
return i;
}
pseudo code..
while (Read())
Write(0);
I'm probably misremembering, but I thought that Joel's question was about counting the "on" bits rather than reversing them.
Here's the obligatory Haskell soln for complementing the bits, it uses the library function, complement:
import Data.Bits
import Data.Int
i = 123::Int
i32 = 123::Int32
i64 = 123::Int64
var2 = 123::Integer
test1 = sho i
test2 = sho i32
test3 = sho i64
test4 = sho var2 -- Exception
sho i = putStrLn $ showBits i ++ "\n" ++ (showBits $complement i)
showBits v = concatMap f (showBits2 v) where
f False = "0"
f True = "1"
showBits2 v = map (testBit v) [0..(bitSize v - 1)]
If the question means to flip all the bits, and you aren't allowed to use C-like operators such as XOR and NOT, then this will work:
bFlipped = -1 - bInput;
I'd modify palmsey's second example, eliminating a bug and eliminating the eval:
n = 0b11001100
n.to_s(2).rjust(8, '0').reverse.to_i(2)
The rjust is important if the number to be bitwise-reversed is a fixed-length bit field -- without it, the reverse of 0b00101010 would be 0b10101 rather than the correct 0b01010100. (Obviously, the 8 should be replaced with the length in question.) I just got tripped up by this one.
Asking here so you can show me how to do in a Non C way (if possible)
Say you have the number 10101010. To change 1s to 0s (and vice versa) you just use XOR:
10101010
^11111111
--------
01010101
Doing it by hand is about as "Non C" as you'll get.
However from the wording of the question it really sounds like it's only turning off "ON" bits... In which case the answer is zero (as has already been mentioned) (unless of course the question is actually asking to swap the order of the bits).
Since the question asked for a non-C way, here's a Scheme implementation, cheerfully plagiarised from SLIB:
(define (bit-reverse k n)
(do ((m (if (negative? n) (lognot n) n) (arithmetic-shift m -1))
(k (+ -1 k) (+ -1 k))
(rvs 0 (logior (arithmetic-shift rvs 1) (logand 1 m))))
((negative? k) (if (negative? n) (lognot rvs) rvs))))
(define (reverse-bit-field n start end)
(define width (- end start))
(let ((mask (lognot (ash -1 width))))
(define zn (logand mask (arithmetic-shift n (- start))))
(logior (arithmetic-shift (bit-reverse width zn) start)
(logand (lognot (ash mask start)) n))))
Rewritten as C (for people unfamiliar with Scheme), it'd look something like this (with the understanding that in Scheme, numbers can be arbitrarily big):
int
bit_reverse(int k, int n)
{
int m = n < 0 ? ~n : n;
int rvs = 0;
while (--k >= 0) {
rvs = (rvs << 1) | (m & 1);
m >>= 1;
}
return n < 0 ? ~rvs : rvs;
}
int
reverse_bit_field(int n, int start, int end)
{
int width = end - start;
int mask = ~(-1 << width);
int zn = mask & (n >> start);
return (bit_reverse(width, zn) << start) | (~(mask << start) & n);
}
Reversing the bits.
For example we have a number represented by 01101011 . Now if we reverse the bits then this number will become 11010110. Now to achieve this you should first know how to do swap two bits in a number.
Swapping two bits in a number:-
XOR both the bits with one and see if results are different. If they are not then both the bits are same otherwise XOR both the bits with XOR and save it in its original number;
Now for reversing the number
FOR I less than Numberofbits/2
swap(Number,I,NumberOfBits-1-I);