What might cause "Undefined Behaviour" in this parallel GPU code?

What might cause "Undefined Behaviour" in this parallel GPU code? - cuda

Lets assume core1 and core2 try writing their variables a and b to same memory location.
How can UB be explained here?
We dont know if a or b is written to that memory location(as a last action).
We dont even know what is written there (a garbage)
Even the target memory address can be miscalculated(segfault?).
Some logical gates make wrong currents and CPU disables itself
CPU's frequency information becomes corrupt and goes high overclock(and break itself)
Can I assume only the first option is valid for all vendors of CPU( and GPU)?
I just converted below code into a parallel GPU code and it seems to be working fine.
Generic code:
for (j=0; j<YRES/CELL; j++) // this is parallelized
for (i=0; i<XRES/CELL; i++) // this is parallelized
{
r = fire_r[j][i];
g = fire_g[j][i];
b = fire_b[j][i];
if (r || g || b)
for (y=-CELL; y<2*CELL; y++)
for (x=-CELL; x<2*CELL; x++)
addpixel(i*CELL+x, j*CELL+y, r, g, b, fire_alpha[y+CELL][x+CELL]);
//addpixel accesses neighbour cells' informations and writes on them
//and makes UB
r *= 8;
g *= 8;
b *= 8;
for (y=-1; y<2; y++)
for (x=-1; x<2; x++)
if ((x || y) && i+x>=0 && j+y>=0 && i+x<XRES/CELL && j+y<YRES/CELL)
{
r += fire_r[j+y][i+x];
g += fire_g[j+y][i+x];
b += fire_b[j+y][i+x];
}
r /= 16;
g /= 16;
b /= 16;
fire_r[j][i] = r>4 ? r-4 : 0; // UB
fire_g[j][i] = g>4 ? g-4 : 0; // UB
fire_b[j][i] = b>4 ? b-4 : 0;
}
Opencl:
" int i=get_global_id(0); int j=get_global_id(1);"
" int VIDXRES="+std::to_string(kkVIDXRES)+";"
" int VIDYRES="+std::to_string(kkVIDYRES)+";"
" int XRES="+std::to_string(kkXRES)+";"
" int CELL="+std::to_string(kkCELL)+";"
" int YRES="+std::to_string(kkYRES)+";"
" int x=0,y=0,r=0,g=0,b=0,nx=0,ny=0;"
" r = fire_r[j*(XRES/CELL)+i];"
" g = fire_g[j*(XRES/CELL)+i];"
" b = fire_b[j*(XRES/CELL)+i];"
" int counterx=0;"
" if (r || g || b)"
" for (y=-CELL; y<2*CELL; y++){"
" for (x=-CELL; x<2*CELL; x++){"
" addpixel(i*CELL+x, j*CELL+y, r, g, b, fire_alpha[(y+CELL)*(3*CELL)+(x+CELL)],vid,vido);"
" }}"
" r *= 8;"
" g *= 8;"
" b *= 8;"
" for (y=-1; y<2; y++){"
" for (x=-1; x<2; x++){"
" if ((x || y) && i+x>=0 && j+y>=0 && i+x<XRES/CELL && j+y<YRES/CELL)"
" {"
" r += fire_r[(j+y)*(XRES/CELL)+(i+x)];"
" g += fire_g[(j+y)*(XRES/CELL)+(i+x)];"
" b += fire_b[(j+y)*(XRES/CELL)+(i+x)];"
" }}}"
" r /= 16;"
" g /= 16;"
" b /= 16;"
" fire_r[j*(XRES/CELL)+i] = (r>4 ? r-4 : 0);"
" fire_g[j*(XRES/CELL)+i] = (g>4 ? g-4 : 0);"
" fire_b[j*(XRES/CELL)+i] = (b>4 ? b-4 : 0);"
Here is picture of some rare artifacts of a 2D NDrangeKernel 's local boundary UB. Can these kill my GPU?

On xf86 and xf86_64 architectures it means We dont know if a or b is written to that memory location(as a last action), because load/store operations of 32 (for both) or 64 bit (xf86_64 only) memory aligned datatypes are atomic.
On other architectures usually We dont even know what is written there (a garbage) is a valid answer - for sure on RISC architectures, I currently don't know on GPU's.
Note that The fact the code works doesn't imply that it is correct and in the 99% of the times it's the source of sentences like "there's a compiler bug, the code was working until the previous version" or "the code works on the development machine. The server selected for production is broken" :)
EDIT:
On NVidia GPUs we have weakly-ordered memory model. In the description on the Cuda C Programming guide it's not explicitly stated that store operations are atomic. The write operations come from the same thread, so it does not mean that load/store operations are atomic.

For the code above, IMHO the first option is the only possible one. Basically, if you assume that you have enough threads/processors to execute all the loops in parallel, the inner nested loops (the x and y ones) will have undetermined values.
For example, if we consider only the
r += fire_r[j+y][i+x];
section, the value at fire_r[j+y][i+x] can be the original one just as well as the result of another instance of the same loop being finished in another thread.

Related

Bit tricks to find the first position where the number of 0s equals the number of 1s

Suppose I have a 32 or 64 bit unsigned integer.
What is the fastest way to find the index i of the leftmost bit such that the number of 0s in the leftmost i bits equals the number of 1s in the leftmost i bits?
I was thinking of some bit tricks like the ones mentioned here.
I am interested in recent x86_64 processor. This might be relevant as some processor support instructions as POPCNT (count the number of 1s) or LZCNT (counts the number of leading 0s).
If it helps, it is possible to assume that the first bit has always a certain value.
Example (with 16 bits):
If the integer is
1110010100110110b
^
i
then i=10 and it corresponds to the marked position.
A possible (slow) implementation for 16-bit integers could be:
mask = 1000000000000000b
pos = 0
count=0
do {
if(x & mask)
count++;
else
count--;
pos++;
x<<=1;
} while(count)
return pos;
Edit: fixed bug in code as per #njuffa comment.

I don't have any bit tricks for this, but I do have a SIMD trick.
First a few observations,
Interpreting 0 as -1, this problem becomes "find the first i so that the first i bits sum to 0".
0 is even but all the bits have odd values under this interpretation, which gives the insight that i must be even and this problem can be analyzed by blocks of 2 bits.
01 and 10 don't change the balance.
After spreading the groups of 2 out to bytes (none of the following is tested),
// optionally use AVX2 _mm_srlv_epi32 instead of ugly variable set
__m128i spread = _mm_shuffle_epi8(_mm_setr_epi32(x, x >> 2, x >> 4, x >> 6),
_mm_setr_epi8(0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15));
spread = _mm_and_si128(spread, _mm_set1_epi8(3));
Replace 00 by -1, 11 by 1, and 01 and 10 by 0:
__m128i r = _mm_shuffle_epi8(_mm_setr_epi8(-1, 0, 0, 1, 0,0,0,0,0,0,0,0,0,0,0,0),
spread);
Calculate the prefix sum:
__m128i pfs = _mm_add_epi8(r, _mm_bsrli_si128(r, 1));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 2));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 4));
pfs = _mm_add_epi8(pfs, _mm_bsrli_si128(pfs, 8));
Find the highest 0:
__m128i iszero = _mm_cmpeq_epi8(pfs, _mm_setzero_si128());
return __builtin_clz(_mm_movemask_epi8(iszero) << 15) * 2;
The << 15 and *2 appear because the resulting mask is 16 bits but the clz is 32 bit, it's shifted one less because if the top byte is zero that indicates that 1 group of 2 is taken, not zero.

This is a solution for 32-bit data using classical bit-twiddling techniques. The intermediate computation requires 64-bit arithmetic and logic operations. I have to tried to stick to portable operations as far as it was possible. Required is an implementation of the POSIX function ffsll to find the least-significant 1-bit in a 64-bit long long, and a custom function rev_bit_duos that reverses the bit-duos in a 32-bit integer. The latter could be replaced with a platform-specific bit-reversal intrinsic, such as the __rbit intrinsic on ARM platforms.
The basic observation is that if a bit-group with an equal number of 0-bits and 1-bits can be extracted, it must contain an even number of bits. This means we can examine the operand in 2-bit groups. We can further restrict ourselves to tracking whether each 2-bit increases (0b11), decreases (0b00) or leaves unchanged (0b01, 0b10) a running balance of bits. If we count positive and negative changes with separate counters, 4-bit counters will suffice unless the input is 0 or 0xffffffff, which can be handled separately. Based on comments to the question, these cases shouldn't occur. By subtracting the negative change count from the positive change count for each 2-bit group we can find at which group the balance becomes zero. There may be multiple such bit groups, we need to find the first one.
The processing can be parallelized by expanding each 2-bit group into a nibble that then can serve as a change counter. The prefix sum can be computed via integer multiply with an appropriate constant, which provides the necessary shift & add operations at each nibble position. Efficient ways for parallel nibble-wise subtraction are well-known, likewise there is a well-known technique due to Alan Mycroft for detecting zero-bytes that is trivially changeable to zero-nibble detection. POSIX function ffsll is then applied to find the bit position of that nibble.
Slightly problematic is the requirement for extraction of a left-most bit group, rather than a right-most, since Alan Mycroft's trick only works for finding the first zero-nibble from the right. Also, handling the prefix-sum for left-most bit group require use of a mulhi operation which may not be easily available, and may be less efficient than standard integer multiplication. I have addressed both of these issues by simply bit-reversing the original operand up front.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
/* Reverse bit-duos using classic binary partitioning algorithm */
inline uint32_t rev_bit_duos (uint32_t a)
{
uint32_t m;
a = (a >> 16) | (a << 16); // swap halfwords
m = 0x00ff00ff; a = ((a >> 8) & m) | ((a << 8) & ~m); // swap bytes
m = (m << 4)^m; a = ((a >> 4) & m) | ((a << 4) & ~m); // swap nibbles
m = (m << 2)^m; a = ((a >> 2) & m) | ((a << 2) & ~m); // swap bit-duos
return a;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int solution (uint32_t x)
{
const uint64_t mask16 = 0x0000ffff0000ffffULL; // alternate half-words
const uint64_t mask8 = 0x00ff00ff00ff00ffULL; // alternate bytes
const uint64_t mask4h = 0x0c0c0c0c0c0c0c0cULL; // alternate nibbles, high bit-duo
const uint64_t mask4l = 0x0303030303030303ULL; // alternate nibbles, low bit-duo
const uint64_t nibble_lsb = 0x1111111111111111ULL;
const uint64_t nibble_msb = 0x8888888888888888ULL;
uint64_t a, b, r, s, t, expx, pc_expx, nc_expx;
int res;
/* common path can't handle all 0s and all 1s due to counter overflow */
if ((x == 0) || (x == ~0)) return 0;
/* make zero-nibble detection work, and simplify prefix sum computation */
x = rev_bit_duos (x); // reverse bit-duos
/* expand each bit-duo into a nibble */
expx = x;
expx = ((expx << 16) | expx) & mask16;
expx = ((expx << 8) | expx) & mask8;
expx = ((expx << 4) | expx);
expx = ((expx & mask4h) * 4) + (expx & mask4l);
/* compute positive and negative change counts for each nibble */
pc_expx = expx & ( expx >> 1) & nibble_lsb;
nc_expx = ~expx & (~expx >> 1) & nibble_lsb;
/* produce prefix sums for positive and negative change counters */
a = pc_expx * nibble_lsb;
b = nc_expx * nibble_lsb;
/* subtract positive and negative prefix sums, nibble-wise */
s = a ^ ~b;
r = a | nibble_msb;
t = b & ~nibble_msb;
s = s & nibble_msb;
r = r - t;
r = r ^ s;
/* find first nibble that is zero using Alan Mycroft's magic */
r = (r - nibble_lsb) & (~r & nibble_msb);
res = ffsll (r) / 2; // account for bit-duo to nibble expansion
return res;
}
/* Return the number of most significant (leftmost) bits that must be extracted
to achieve an equal count of 1-bits and 0-bits in the extracted bit group.
Return 0 if no such bit group exists.
*/
int reference (uint32_t x)
{
int count = 0;
int bits = 0;
uint32_t mask = 0x80000000;
do {
bits++;
if (x & mask) {
count++;
} else {
count--;
}
x = x << 1;
} while ((count) && (bits <= (int)(sizeof(x) * CHAR_BIT)));
return (count) ? 0 : bits;
}
int main (void)
{
uint32_t x = 0;
do {
uint32_t ref = reference (x);
uint32_t res = solution (x);
if (res != ref) {
printf ("x=%08x res=%u ref=%u\n\n", x, res, ref);
}
x++;
} while (x);
return EXIT_SUCCESS;
}

A possible solution (for 32-bit integers). I'm not sure if it can be improved / avoid the use of lookup tables. Here x is the input integer.
//Look-up table of 2^16 elements.
//The y-th is associated with the first 2 bytes y of x.
//If the wanted bit is in y, LUT1[y] is minus the position of the bit
//If the wanted bit is not in y, LUT1[y] is the number of ones in excess in y minus 1 (between 0 and 15)
LUT1 = ....
//Look-up talbe of 16 * 2^16 elements.
//The y-th element is associated to two integers y' and y'' of 4 and 16 bits, respectively.
//y' is the number of excess ones in the first byte of x, minus 1
//y'' is the second byte of x. The table contains the answer to return.
LUT2 = ....
if(LUT1[x>>16] < 0)
return -LUT1[x>>16];
return LUT2[ (LUT1[x>>16]<<16) | (x & 0xFFFF) ]
This requires ~1MB for the lookup tables.
The same idea also works using 4 lookup tables (one per byte of x). The requires more operations but brings down the memory to 12KB.
LUT1 = ... //2^8 elements
LUT2 = ... //8 * 2^8 elements
LUT3 = ... //16 * 2^8 elements
LUT3 = ... //24 * 2^8 elements
y = x>>24
if(LUT1[y] < 0)
return -LUT1[y];
y = (LUT1[y]<<8) | ((x>>16) & 0xFF);
if(LUT2[y] < 0)
return -LUT2[y];
y = (LUT2[y]<<8) | ((x>>8) & 0xFF);
if(LUT3[y] < 0)
return -LUT3[y];
return LUT4[(LUT2[y]<<8) | (x & 0xFF) ];

Divide algorithm for binary number with run time of O(logn)

I need to write in simple assembly language (not assembly syntax) a program that calculates the division of two binary numbers of 16 bits without a reminder in O(logn), and I wondered if there is an efficient algorithm to do it.
If found some algorithms on the web, but all of them are looking for access to specific bit in the number, and I can't do it..
The only arithmetic operations I have are +, -, shift right/left but only ones each operation, &, |, ! and thats all apparently..
Thanks,
Eliav

this should work.
does it help you?
C-Style:
c=1;
// find the maximum n for b*2^n <= a
// c=2^n
while((b << 1) <= a) {
b = b << 1;
c = c << 1;
}
while (c > 0) {
if (a - b >= 0) {
a -= b;
result += c;
}
c = c >> 1;
b = b >> 1;
}
Assembly-Style:
registers s0,s1,s2,s3, zero
syntax <instr.> <dest>,<src>,<arg>
%result: s0
%a: s1
%b: s2
%c: s3
add, s3, zero, 1
loop_1_start:
left_shift s4,s2,1
jump_if_greater s4,s1,loop_1_end
left_shift s2,s2,1
left_shift s3,s3,1
jump loop_1_start
loop_1_end:
loop_2_start:
jump_if_greate_or_equal s3,zero,loop_2_end
if_start:
sub s4,s1,s2
jump_if_smaller s4, zero, if_end
sub s1,s1,s2
add s0,s0,s3
if_end:
right_shift s2,s2,1
right_shift s3,s3,1
jump loop_2_start
loop_2_end:

ActionScript 3 - What do these codes do?

I'me trying to understand some Action Script 3 features in order to port some code.
Code 1
How does the "++" influences the index part mean? If idx_val=0 then what xvaluer index will be modified?
xvaluer(++idx_val) = "zero";
Code 2
Then I have this: what is the meaning of this part of code?
What is being assigned to bUnicode in the last 3 lines?
(can you explain me the "<<"s and ">>"s)
bUnicode = new Array(2);
i = (i + 1);
i = (i + 1);
bUnicode[0] = aData[(i + 1)] << 2 | aData[(i + 1)] >> 4;
i = (i + 1);
bUnicode[1] = aData[i] << 4 | aData[(i + 1)] >> 2;
Code 3
I haven't the faintest idea of what is happening here.
What is "as" ? What is the "?" ?
bL = c > BASELENGTH ? (INVALID) : (s_bReverseLPad[c]);
Code 4
What is "&&" ?
if ((i + 1) < aData.length && s_bReverseUPad(aData((i + 1))) != INVALID)
Code 5
What is "as" ? What is the "?" ?
n2 = c < 0 ? (c + 256) as (c)
bOut.push(n1 >> 2 & 63)
bOut.push((n1 << 4 | n2 >> 4) & 63)//What is the single "&" ?
bOut.push(n2 << 2 & 63)
Finally, what are the differences between "||" and "|", and between "=" and "==" ?

Code 1: ++i is almost the same thing as i++ or i += 1; The only real difference is that it's modified before it is evaluated. Read more here.
Code 2: << and >> are bitwise shifts, they literally shift bits by one place. You really need to understand Binary before you can mess about with these operators. I would recommend reading this tutorial all the way through.
Code 3: This one is called Ternary Operator and it's actually quite simple. It's a one line if / else statement. bL = c > BASELENGTH ? (INVALID) : (s_bReverseLPad[c]); is equivalent to:
if(c > BASELENGTH) {
bL = INVALID;
} else {
bL = s_bReverseLPad[c];
}
Read more about it here.
Code 4: "The conditional-AND operator (&&) performs a logical-AND of its bool operands, but only evaluates its second operand if necessary." There is also the conditional-OR operator to keep in mind (||).
As an example of the AND operator here is some code:
if(car.fuel && car.wheels) car.move();
Read more about it here.
Code 5: From AS3 Reference: as "Evaluates whether an expression specified by the first operand is a member of the data type specified by the second operand." So basically you're casting one type to another, but only if it's possible otherwise you will get null.
& is Bitwise AND operator and | is Bitwise OR operator, again refer to this article.
= and == are two different operators. The former(=) is called Basic Assignment meaning it is used when you do any kind of assignment like: i = 3;. The later(==) is called Equal to and it is used to check if a value is equal to something else. if(i == 3) // DO STUFF;. Pretty straight forward.
The only part that doesn't make sense to me is the single question mark. Ternary Operator needs to have both ? and :. Does this code actually run for you? Perhaps a bit more context would help. What type is c?
n2 = c < 0 ? (c + 256) as (c)

Which is the best way to check a bit array in Cuda?

I need to launch N threads (in one block)
This is the code, 'e' is a bignumber on 1024b. I need to copy it on the gpu and read it bit by bit.
Host code:
unsigned char *__e;
BIGNUM *e = BN_new();
unsigned char exp[128];
// e
i = cudaMalloc( (void**)&__e, 128* sizeof(unsigned char) );
if(i != cudaSuccess)
printf("cudaMalloc __e FAIL! Code: %d\n", i);
BN_bn2bin128B(e, exp); // copy data in exp
for(i=0; i<128; i++)
exp[i] = reverse(exp[i]);
i = cudaMemcpy( __e, exp, 128* sizeof(unsigned char), cudaMemcpyHostToDevice);
if(i != cudaSuccess)
printf("cudaMemcpy __e FAIL! Code: %d\n", i);
unsigned char reverse(unsigned char b) {
b = (b & 0xF0) >> 4 | (b & 0x0F) << 4;
b = (b & 0xCC) >> 2 | (b & 0x33) << 2;
b = (b & 0xAA) >> 1 | (b & 0x55) << 1;
return b;
}
Device code:
for(int i=0; i<1024; i++)
if(ISBITSET(__e, i) == 1)
//do something
Header:
#define ISBITSET(x,i) ((x[i>>3] & (1<<(i&7)))!=0)
Unfortunately ISBITSET doesnt accept anything different from __e, so I cant check the further values in __e itself
How can I solve it? Or is there a better way?

The GPU is a 32 bit machine, so you'll want to process your 1024 bits 32 bits at a time, not 8. So, you should replace all unsigned char with unsigned int and adjust the values accordingly.
The GPU has a fast PTX instruction for reversing 32 bits at a time, so you may want to implement that on the GPU. The instruction is called brev. To use it, you would add inline PTX, something like this (untested):
asm("brev.b32 %0, %1;" : "=r"(dst_var) : "r"(src_var));
For more information, see NVIDIA's document, "Using Inline PTX Assembly In CUDA".
for(int i=0; i<1024; i++)
if(ISBITSET(__e, i) == 1)
//do something
This code may have a performance problem. Presuming that there is a 50% chance that a bit is on, you only get 50% of the possible performance since half of your threads will have to wait while the other half perform the //do something. I can't think of a workaround though. You may also want to launch threads instead of looping.
Unfortunately ISBITSET doesnt accept anything different from __e, so I cant check the further values in __e itself
Could you elaborate? The ISBITSET macro looks ok to me and looks like it can process any array of unsigned chars, which is what __e is.

How can I reverse the ON bits in a byte?

I was reading Joel's book where he was suggesting as interview question:
Write a program to reverse the "ON" bits in a given byte.
I only can think of a solution using C.
Asking here so you can show me how to do in a Non C way (if possible)

I claim trick question. :) Reversing all bits means a flip-flop, but only the bits that are on clearly means:
return 0;

What specifically does that question mean?
Good question. If reversing the "ON" bits means reversing only the bits that are "ON", then you will always get 0, no matter what the input is. If it means reversing all the bits, i.e. changing all 1s to 0s and all 0s to 1s, which is how I initially read it, then that's just a bitwise NOT, or complement. C-based languages have a complement operator, ~, that does this. For example:
unsigned char b = 102; /* 0x66, 01100110 */
unsigned char reverse = ~b; /* 0x99, 10011001 */

What specifically does that question mean?
Does reverse mean setting 1's to 0's and vice versa?
Or does it mean 00001100 --> 00110000 where you reverse their order in the byte? Or perhaps just reversing the part that is from the first 1 to the last 1? ie. 00110101 --> 00101011?
Assuming it means reversing the bit order in the whole byte, here's an x86 assembler version:
; al is input register
; bl is output register
xor bl, bl ; clear output
; first bit
rcl al, 1 ; rotate al through carry
rcr bl, 1 ; rotate carry into bl
; duplicate above 2-line statements 7 more times for the other bits
not the most optimal solution, a table lookup is faster.

Reversing the order of bits in C#:
byte ReverseByte(byte b)
{
byte r = 0;
for(int i=0; i<8; i++)
{
int mask = 1 << i;
int bit = (b & mask) >> i;
int reversedMask = bit << (7 - i);
r |= (byte)reversedMask;
}
return r;
}
I'm sure there are more clever ways of doing it but in that precise case, the interview question is meant to determine if you know bitwise operations so I guess this solution would work.
In an interview, the interviewer usually wants to know how you find a solution, what are you problem solving skills, if it's clean or if it's a hack. So don't come up with too much of a clever solution because that will probably mean you found it somewhere on the Internet beforehand. Don't try to fake that you don't know it neither and that you just come up with the answer because you are a genius, this is will be even worst if she figures out since you are basically lying.

If you're talking about switching 1's to 0's and 0's to 1's, using Ruby:
n = 0b11001100
~n
If you mean reverse the order:
n = 0b11001100
eval("0b" + n.to_s(2).reverse)
If you mean counting the on bits, as mentioned by another user:
n = 123
count = 0
0.upto(8) { |i| count = count + n[i] }
♥ Ruby

I'm probably misremembering, but I
thought that Joel's question was about
counting the "on" bits rather than
reversing them.
Here you go:
#include <stdio.h>
int countBits(unsigned char byte);
int main(){
FILE* out = fopen( "bitcount.c" ,"w");
int i;
fprintf(out, "#include <stdio.h>\n#include <stdlib.h>\n#include <time.h>\n\n");
fprintf(out, "int bitcount[256] = {");
for(i=0;i<256;i++){
fprintf(out, "%i", countBits((unsigned char)i));
if( i < 255 ) fprintf(out, ", ");
}
fprintf(out, "};\n\n");
fprintf(out, "int main(){\n");
fprintf(out, "srand ( time(NULL) );\n");
fprintf(out, "\tint num = rand() %% 256;\n");
fprintf(out, "\tprintf(\"The byte %%i has %%i bits set to ON.\\n\", num, bitcount[num]);\n");
fprintf(out, "\treturn 0;\n");
fprintf(out, "}\n");
fclose(out);
return 0;
}
int countBits(unsigned char byte){
unsigned char mask = 1;
int count = 0;
while(mask){
if( mask&byte ) count++;
mask <<= 1;
}
return count;
}

The classic Bit Hacks page has several (really very clever) ways to do this, but it's all in C. Any language derived from C syntax (notably Java) will likely have similar methods. I'm sure we'll get some Haskell versions in this thread ;)

byte ReverseByte(byte b)
{
return b ^ 0xff;
}
That works if ^ is XOR in your language, but not if it's AND, which it often is.

And here's a version directly cut and pasted from OpenJDK, which is interesting because it involves no loop. On the other hand, unlike the Scheme version I posted, this version only works for 32-bit and 64-bit numbers. :-)
32-bit version:
public static int reverse(int i) {
// HD, Figure 7-1
i = (i & 0x55555555) << 1 | (i >>> 1) & 0x55555555;
i = (i & 0x33333333) << 2 | (i >>> 2) & 0x33333333;
i = (i & 0x0f0f0f0f) << 4 | (i >>> 4) & 0x0f0f0f0f;
i = (i << 24) | ((i & 0xff00) << 8) |
((i >>> 8) & 0xff00) | (i >>> 24);
return i;
}
64-bit version:
public static long reverse(long i) {
// HD, Figure 7-1
i = (i & 0x5555555555555555L) << 1 | (i >>> 1) & 0x5555555555555555L;
i = (i & 0x3333333333333333L) << 2 | (i >>> 2) & 0x3333333333333333L;
i = (i & 0x0f0f0f0f0f0f0f0fL) << 4 | (i >>> 4) & 0x0f0f0f0f0f0f0f0fL;
i = (i & 0x00ff00ff00ff00ffL) << 8 | (i >>> 8) & 0x00ff00ff00ff00ffL;
i = (i << 48) | ((i & 0xffff0000L) << 16) |
((i >>> 16) & 0xffff0000L) | (i >>> 48);
return i;
}

pseudo code..
while (Read())
Write(0);

I'm probably misremembering, but I thought that Joel's question was about counting the "on" bits rather than reversing them.

Here's the obligatory Haskell soln for complementing the bits, it uses the library function, complement:
import Data.Bits
import Data.Int
i = 123::Int
i32 = 123::Int32
i64 = 123::Int64
var2 = 123::Integer
test1 = sho i
test2 = sho i32
test3 = sho i64
test4 = sho var2 -- Exception
sho i = putStrLn $ showBits i ++ "\n" ++ (showBits $complement i)
showBits v = concatMap f (showBits2 v) where
f False = "0"
f True = "1"
showBits2 v = map (testBit v) [0..(bitSize v - 1)]

If the question means to flip all the bits, and you aren't allowed to use C-like operators such as XOR and NOT, then this will work:
bFlipped = -1 - bInput;

I'd modify palmsey's second example, eliminating a bug and eliminating the eval:
n = 0b11001100
n.to_s(2).rjust(8, '0').reverse.to_i(2)
The rjust is important if the number to be bitwise-reversed is a fixed-length bit field -- without it, the reverse of 0b00101010 would be 0b10101 rather than the correct 0b01010100. (Obviously, the 8 should be replaced with the length in question.) I just got tripped up by this one.

Asking here so you can show me how to do in a Non C way (if possible)
Say you have the number 10101010. To change 1s to 0s (and vice versa) you just use XOR:
10101010
^11111111
--------
01010101
Doing it by hand is about as "Non C" as you'll get.
However from the wording of the question it really sounds like it's only turning off "ON" bits... In which case the answer is zero (as has already been mentioned) (unless of course the question is actually asking to swap the order of the bits).

Since the question asked for a non-C way, here's a Scheme implementation, cheerfully plagiarised from SLIB:
(define (bit-reverse k n)
(do ((m (if (negative? n) (lognot n) n) (arithmetic-shift m -1))
(k (+ -1 k) (+ -1 k))
(rvs 0 (logior (arithmetic-shift rvs 1) (logand 1 m))))
((negative? k) (if (negative? n) (lognot rvs) rvs))))
(define (reverse-bit-field n start end)
(define width (- end start))
(let ((mask (lognot (ash -1 width))))
(define zn (logand mask (arithmetic-shift n (- start))))
(logior (arithmetic-shift (bit-reverse width zn) start)
(logand (lognot (ash mask start)) n))))
Rewritten as C (for people unfamiliar with Scheme), it'd look something like this (with the understanding that in Scheme, numbers can be arbitrarily big):
int
bit_reverse(int k, int n)
{
int m = n < 0 ? ~n : n;
int rvs = 0;
while (--k >= 0) {
rvs = (rvs << 1) | (m & 1);
m >>= 1;
}
return n < 0 ? ~rvs : rvs;
}
int
reverse_bit_field(int n, int start, int end)
{
int width = end - start;
int mask = ~(-1 << width);
int zn = mask & (n >> start);
return (bit_reverse(width, zn) << start) | (~(mask << start) & n);
}

Reversing the bits.
For example we have a number represented by 01101011 . Now if we reverse the bits then this number will become 11010110. Now to achieve this you should first know how to do swap two bits in a number.
Swapping two bits in a number:-
XOR both the bits with one and see if results are different. If they are not then both the bits are same otherwise XOR both the bits with XOR and save it in its original number;
Now for reversing the number
FOR I less than Numberofbits/2
swap(Number,I,NumberOfBits-1-I);

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

What might cause "Undefined Behaviour" in this parallel GPU code? - cuda

Related

Bit tricks to find the first position where the number of 0s equals the number of 1s

Divide algorithm for binary number with run time of O(logn)

ActionScript 3 - What do these codes do?

Which is the best way to check a bit array in Cuda?

How can I reverse the ON bits in a byte?

Categories

Resources