Join bits using bitwise operations. Is this even possible? - language-agnostic

What I am asking is if it is possible to join all bits in 2 different numbers.
A pseudo-code example:
bytes=array(0x04, 0x3F);
//place bitwise black magic here
print 0x043F;
Another example:
bytes=array(0xFF, 0xFFFF);
//place bitwise black magic here
print 0xFFFFFF;
Yet another example:
bytes=array(0x34F3, 0x54FD);
//place bitwise black magic here
print 0x34F354FD;
I want to restrict this to only and only bitwise operators (>>, <<, |, ^, ~ and &).
This should work at least in PHP and Javascript.
Is this possible in ANY way?
If I'm not being clear, please ask your doubts in a comment.

If I understand your question correctly,
This should be the answer in php:
$temp = $your_first_value << strlen(dechex($the_length_of_the_second_value_in_hex))
$result = $temp | $your_second_value
print dechex($result)
Update: instead of + use the | operator

This problem hinges completely on being able to determine the position of the leftmost 1 in an integer. One way to do that is by "smearing the bits right" and then counting the 1's:
Smearing to the right:
int smearright(int x) {
x |= x >> 1;
x |= x >> 2;
x |= x >> 4;
x |= x >> 8;
x |= x >> 16;
return x;
}
Easy, only bitwise operators there. Counting the bits however involves some sort of addition:
int popcnt(int x) {
x = add(x & 0x55555555, (x >> 1) & 0x55555555);
x = add(x & 0x33333333, (x >> 2) & 0x33333333);
x = add(x & 0x0f0f0f0f, (x >> 4) & 0x0f0f0f0f);
x = add(x & 0x00ff00ff, (x >> 8) & 0x00ff00ff);
x = add(x & 0xffff, (x >> 16) & 0xffff);
return x;
}
But that's OK, add can be implemented as
int add(int x, int y) {
int p = x ^ y;
int g = x & y;
g |= p & (g << 1);
p &= p << 1;
g |= p & (g << 2);
p &= p << 2;
g |= p & (g << 4);
p &= p << 4;
g |= p & (g << 8);
p &= p << 8;
g |= p & (g << 16);
return x ^ y ^ (g << 1);
}
Putting it together:
join = (left << popcnt(smearright(right))) | right;
It's obviously much easier if you had addition (no add function), perhaps surprisingly though, it's even simpler than that with multiplication:
join = (left * (smearright(right) + 1)) | right;
No more popcnt at all!
Implementing multiplication in terms of bitwise operators wouldn't help, that's much worse and I'm not sure you can even do it with the listed operators (unless the right shift is an arithmetic shift, but then it's still a terrible thing involving 32 additions each of which are function themselves).
There were no "sneaky tricks" in this answer, such as using conditions that implicitly test for equality with zero ("hidden" != 0 in an if, ?:, while etc), and the control flow is actually completely linear (function calls are just there to prevent repeated code, everything can be inlined).
Here's an alternative. Instead of taking the popcnt, do a weird variable shift:
int shift_by_mask(int x, int mask) {
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
mask >>= 1;
x <<= mask & 1;
return x;
}
Ok that doesn't make me happy, but here's how you'd use it:
join = shift_by_mask(left, smearright(right)) | right;

Depending on what endian-ness your machine is, you might have to reverse the order of bytes[0] and bytes1 below:
uint8_t bytes[2] = { 0x04, 0x3f };
uint16_t result = (bytes[0] << 8) | bytes[1];
(This is in C, shouldn't be hard to translate to PHP etc., the languages and operators are similar enough)
Update:
OK, now that you've clarified what you want, the basic approach is still the same. What you can do instead is count the number of bits in the right number, then do the bitshift as above on the left number, just with the dynamic number of bits. This works as long as you don't have more bits than fit into the largest numeric type that your language/platform support, so in this example 64 bits.
int rightMaxBits = 0;
uint64_t leftNum = 0x04, rightNum = 0x3f;
uint64_t rightNumCopy = rightNum;
while( rightNumCopy )
{
rightNumCopy >>= 1;
rightMaxBits++;
}
uint64_t resultNum = (leftNum << rightMaxBits) | rightNum;
(Thanks for the bit-counting algo to this SO thread) For signed numbers, I'd suggest you use abs() on the numbers before you call this and then later re-apply the sign in whatever way you want.

Related

Implementing Dijkstra's algorithm with C++ STL

I have implemented the Dijkstra's algorithm as follows
#include <iostream>
#include <bits/stdc++.h>
#include<cstdio>
#define ll long long int
#define mod 1000000007
#define pi 3.141592653589793
#define f first
#define s second
#define pb push_back
#define pf push_front
#define pob pop_back
#define pof pop_front
#define vfor(e, a) for (vector<ll> :: iterator e = a.begin(); e != a.end(); e++)
#define vfind(a, e) find(a.begin(), a.end(), e)
#define forr(i, n) for (ll i = 0; i < n; i++)
#define rfor(i, n) for (ll i = n - 1; i >= 0; i--)
#define fors(i, b, e, steps) for(ll i = b; i < e; i += steps)
#define rfors(i, e, b, steps) for(ll i = e; i > b; i -= steps)
#define mp make_pair
using namespace std;
void up(pair<ll, ll> a[], ll n, ll i, ll indArray[]) {
ll ind = (i - 1) / 2;
while (ind >= 0 && a[ind].s > a[i].s) {
swap(a[ind], a[i]);
indArray[a[ind].f] = ind;
indArray[a[i].f] = i;
i = ind;
ind = (i - 1) / 2;
}
}
void down(pair<ll, ll> a[], ll n, ll i, ll indArray[]) {
ll left = 2 * i + 1;
ll right = 2 * i + 2;
ll m = a[i].s;
ll ind = i;
if (left < n && a[left].s < m) {
ind = left;
m = a[left].s;
}
if (right < n && a[right].s < m) {
ind = right;
}
if (ind != i) {
swap(a[i], a[ind]);
indArray[a[i].f] = i;
indArray[a[ind].f] = ind;
}
}
int main() {
ios_base::sync_with_stdio(false);
cin.tie(NULL);
cout.tie(NULL);
// cout << setprecision(10);
ll n, m;
cin >> n >> m;
vector<pair<ll, ll>> a[n];
forr(i, m) {
ll u, v, w;
cin >> u >> v >> w;
a[u].pb(mp(v, w));
a[v].pb(mp(u, w));
}
ll parent[n];
parent[0] = -1;
pair<ll, ll> dist[n];
forr(i, n) {
dist[i] = mp(i, INT_MAX);
}
dist[0].s = 0;
ll ind[n];
iota(ind, ind + n, 0);
ll ans[n];
ans[0] = 0;
bool visited[n];
fill(visited, visited + n, false);
ll size = n;
forr(i, n) {
ll u = dist[0].f;
visited[u] = true;
ll d1 = dist[0].s;
ans[u] = dist[0].s;
swap(dist[0], dist[size - 1]);
size--;
down(dist, size, 0, ind);
for (auto e : a[u]) {
if (visited[e.f]){
continue;
}
ll v = e.f;
ll j = ind[v];
if (dist[j].s > d1 + e.s) {
dist[j].s = d1 + e.s;
up(dist, size, j, ind);
parent[v] = u;
}
}
}
stack<ll> st;
forr(i, n) {
ll j = i;
while (j != -1) {
st.push(j);
j = parent[j];
}
while (!st.empty()) {
cout << st.top() << "->";
st.pop();
}
cout << " Path length is " << ans[i];
cout << '\n';
}
}
This implementation is correct and giving correct output.
As it can be seen every time I select the node with lowest key value(distance from source) and then I update the keys on all the adjacent nodes of the selected node. After updating the keys of the adjacent nodes I am calling the 'up' function as to maintain the min heap properties. But priority queue is present in the c++ stl. How can I use them to avoid the functions up and down.
The thing is I need to be able to find the index of the node-key pair in the mean heap whose key needs to be updated. Here in this code I have used a seperate ind array which is updated every time the min heap is updated.
But how to make use of c++ stl
Like you implied, we cannot random-access efficiently with std::priority_queue. For this case I would suggest that you use std::set. It is not actually a heap but a balanced binary search tree. However it works the desired way you wanted. find, insert and erase methods are all O(log n) so you can insert/erase/update a value with desired time since update can be done with erase-then-insert. And accessing minimum is O(1).
You may refer to this reference implementation like the exact way I mentioned. With your adjacency list, the time complexity is O(E log V) where E is number of edges, V is number of vertices.
And please note that
With default comparator, std::set::begin() method returns the min element if non-empty
In this code, it puts the distance as first and index as second. By doing so, the set elements are sorted with distance in ascending order
% I did not look into the implementation of up and down of your code in detail.

How do I find the minimum number of bits for unsigned magnitude and 2's compliment?

I am slightly confused about finding the minimum number of bits for an unsigned magnitude and 2's compliment.
This has been my reasoning so far:
For example,
a) 243 decimal
Since 2^8 = 256, Unsigned and 2's compliment would both need a minimum 8 bits.
b) -56 decimal
This is impossible for unsigned.
2^6 = 64. One more bit is needed to show it is negative, so minimum 7 bits.
Is my reasoning correct?
The "bits needed" for unsigned is just the most significant bit (+1, depending on the definition for MSB), and for two's complement you can just negate the value and subtract one to make it positive, then add another bit for the sign flag.
int LeadingZeroCount(long value) {
// http://en.wikipedia.org/wiki/Hamming_weight
unsigned long x = value;
x |= (x >> 1); x |= (x >> 2); x |= (x >> 4);
x |= (x >> 8); x |= (x >> 16); x |= (x >> 32);
x -= (x >> 1) & 0x5555555555555555;
x = (x & 0x3333333333333333) + ((x >> 2) & 0x3333333333333333);
x = (x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F;
x += x >> 8; x += x >> 16; x += x >> 32;
return (sizeof(value) << 3) - (x & 0x7F);
}
int MostSignificantBit(long value) {
return (sizeof(value) << 3) - LeadingZeroCount(value);
}
int BitsNeededUnsigned(unsigned long value) {
return MostSignificantBit(value);
}
int BitsNeededTwosComplement(long value) {
if (value < 0)
return BitsNeededUnsigned(-value - 1) + 1;
else
return BitsNeededUnsigned(value);
}
int main() {
printf("%d\n", BitsNeededUnsigned(243));
printf("%d\n", BitsNeededTwosComplement(243));
printf("%d\n", BitsNeededTwosComplement(-56));
return 0;
}
That's based on your definition of the problem, at least. To me it seems like +243 would need 9 bits for two's complement since the 0 for the sign bit is still relevant.

Blending Function/Bezier

Am I calculating the Bezier blend wrong? Any help would be appreciated.
Thank you very much.
double bezierBlend(int i, double u, int m) {
double blend = 1;
blend = factorial(m) * pow(u, i) * pow(1 - u, (m - i)) / (factorial(i) * factorial(m - i));
return blend;
}
Here's a sample to compute the Bezier blend function, following directly from the formulation:
double choose( long n, long k )
{
long j;
double a;
a = 1;
for (j = k + 1; j <= n; j++)
a *= j;
for (j = 1; j <= n - k; j++)
a /= j;
return a;
};
double bezierBlend( int i, double t, int n )
{
return choose( n, i ) * pow(1 - t, n - i) * pow( t, i );
}
For most applications though, computing the powers and the binomial coefficients each time is absurdly inefficient. In typical applications, the degree of the curve is constant (e.g., 2 for quadratic or 3 for cubic), and you can compute the function much more efficiently by pre-expanding the formula. Here's an example for cubic curves:
double BezCoef(int i, double t)
{
double tmp = 1-t;
switch (i)
{
case 0: return tmp*tmp*tmp;
case 1: return 3*tmp*tmp*t;
case 2: return 3*tmp*t*t;
case 3: return t*t*t;
}
return 0; // not reached
}

Given an integer, how do I find the next largest power of two using bit-twiddling?

If I have a integer number n, how can I find the next number k > n such that k = 2^i, with some i element of N by bitwise shifting or logic.
Example: If I have n = 123, how can I find k = 128, which is a power of two, and not 124 which is only divisible by two. This should be simple, but it eludes me.
For 32-bit integers, this is a simple and straightforward route:
unsigned int n;
n--;
n |= n >> 1; // Divide by 2^k for consecutive doublings of k up to 32,
n |= n >> 2; // and then or the results.
n |= n >> 4;
n |= n >> 8;
n |= n >> 16;
n++; // The result is a number of 1 bits equal to the number
// of bits in the original number, plus 1. That's the
// next highest power of 2.
Here's a more concrete example. Let's take the number 221, which is 11011101 in binary:
n--; // 1101 1101 --> 1101 1100
n |= n >> 1; // 1101 1100 | 0110 1110 = 1111 1110
n |= n >> 2; // 1111 1110 | 0011 1111 = 1111 1111
n |= n >> 4; // ...
n |= n >> 8;
n |= n >> 16; // 1111 1111 | 1111 1111 = 1111 1111
n++; // 1111 1111 --> 1 0000 0000
There's one bit in the ninth position, which represents 2^8, or 256, which is indeed the next largest power of 2. Each of the shifts overlaps all of the existing 1 bits in the number with some of the previously untouched zeroes, eventually producing a number of 1 bits equal to the number of bits in the original number. Adding one to that value produces a new power of 2.
Another example; we'll use 131, which is 10000011 in binary:
n--; // 1000 0011 --> 1000 0010
n |= n >> 1; // 1000 0010 | 0100 0001 = 1100 0011
n |= n >> 2; // 1100 0011 | 0011 0000 = 1111 0011
n |= n >> 4; // 1111 0011 | 0000 1111 = 1111 1111
n |= n >> 8; // ... (At this point all bits are 1, so further bitwise-or
n |= n >> 16; // operations produce no effect.)
n++; // 1111 1111 --> 1 0000 0000
And indeed, 256 is the next highest power of 2 from 131.
If the number of bits used to represent the integer is itself a power of 2, you can continue to extend this technique efficiently and indefinitely (for example, add a n >> 32 line for 64-bit integers).
There is actually a assembly solution for this (since the 80386 instruction set).
You can use the BSR (Bit Scan Reverse) instruction to scan for the most significant bit in your integer.
bsr scans the bits, starting at the
most significant bit, in the
doubleword operand or the second word.
If the bits are all zero, ZF is
cleared. Otherwise, ZF is set and the
bit index of the first set bit found,
while scanning in the reverse
direction, is loaded into the
destination register
(Extracted from: http://dlc.sun.com/pdf/802-1948/802-1948.pdf)
And than inc the result with 1.
so:
bsr ecx, eax //eax = number
jz #zero
mov eax, 2 // result set the second bit (instead of a inc ecx)
shl eax, ecx // and move it ecx times to the left
ret // result is in eax
#zero:
xor eax, eax
ret
In newer CPU's you can use the much faster lzcnt instruction (aka rep bsr). lzcnt does its job in a single cycle.
A more mathematical way, without loops:
public static int ByLogs(int n)
{
double y = Math.Floor(Math.Log(n, 2));
return (int)Math.Pow(2, y + 1);
}
Here's a logic answer:
function getK(int n)
{
int k = 1;
while (k < n)
k *= 2;
return k;
}
Here's John Feminella's answer implemented as a loop so it can handle Python's long integers:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
n -= 1 # greater than OR EQUAL TO n
shift = 1
while (n+1) & n: # n+1 is not a power of 2 yet
n |= n >> shift
shift <<= 1
return n + 1
It also returns faster if n is already a power of 2.
For Python >2.7, this is simpler and faster for most N:
def next_power_of_2(n):
"""
Return next power of 2 greater than or equal to n
"""
return 2**(n-1).bit_length()
This answer is based on constexpr to prevent any computing at runtime when the function parameter is passed as const
Greater than / Greater than or equal to
The following snippets are for the next number k > n such that k = 2^i
(n=123 => k=128, n=128 => k=256) as specified by OP.
If you want the smallest power of 2 greater than OR equal to n then just replace __builtin_clzll(n) by __builtin_clzll(n-1) in the following snippets.
C++11 using GCC or Clang (64 bits)
#include <cstdint> // uint64_t
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * 8 - __builtin_clzll(n));
}
Enhancement using CHAR_BIT as proposed by martinec
#include <cstdint>
constexpr uint64_t nextPowerOfTwo64 (uint64_t n)
{
return 1ULL << (sizeof(uint64_t) * CHAR_BIT - __builtin_clzll(n));
}
C++17 using GCC or Clang (from 8 to 128 bits)
#include <cstdint>
template <typename T>
constexpr T nextPowerOfTwo64 (T n)
{
T clz = 0;
if constexpr (sizeof(T) <= 32)
clz = __builtin_clzl(n); // unsigned long
else if (sizeof(T) <= 64)
clz = __builtin_clzll(n); // unsigned long long
else { // See https://stackoverflow.com/a/40528716
uint64_t hi = n >> 64;
uint64_t lo = (hi == 0) ? n : -1ULL;
clz = _lzcnt_u64(hi) + _lzcnt_u64(lo);
}
return T{1} << (CHAR_BIT * sizeof(T) - clz);
}
Other compilers
If you use a compiler other than GCC or Clang, please visit the Wikipedia page listing the Count Leading Zeroes bitwise functions:
Visual C++ 2005 => Replace __builtin_clzl() by _BitScanForward()
Visual C++ 2008 => Replace __builtin_clzl() by __lzcnt()
icc => Replace __builtin_clzl() by _bit_scan_forward
GHC (Haskell) => Replace __builtin_clzl() by countLeadingZeros()
Contribution welcome
Please propose improvements within the comments. Also propose alternative for the compiler you use, or your programming language...
See also similar answers
nulleight's answer
ydroneaud's answer
Here's a wild one that has no loops, but uses an intermediate float.
// compute k = nextpowerof2(n)
if (n > 1)
{
float f = (float) n;
unsigned int const t = 1U << ((*(unsigned int *)&f >> 23) - 0x7f);
k = t << (t < n);
}
else k = 1;
This, and many other bit-twiddling hacks, including the on submitted by John Feminella, can be found here.
assume x is not negative.
int pot = Integer.highestOneBit(x);
if (pot != x) {
pot *= 2;
}
If you use GCC, MinGW or Clang:
template <typename T>
T nextPow2(T in)
{
return (in & (T)(in - 1)) ? (1U << (sizeof(T) * 8 - __builtin_clz(in))) : in;
}
If you use Microsoft Visual C++, use function _BitScanForward() to replace __builtin_clz().
function Pow2Thing(int n)
{
x = 1;
while (n>0)
{
n/=2;
x*=2;
}
return x;
}
Bit-twiddling, you say?
long int pow_2_ceil(long int t) {
if (t == 0) return 1;
if (t != (t & -t)) {
do {
t -= t & -t;
} while (t != (t & -t));
t <<= 1;
}
return t;
}
Each loop strips the least-significant 1-bit directly. N.B. This only works where signed numbers are encoded in two's complement.
What about something like this:
int pot = 1;
for (int i = 0; i < 31; i++, pot <<= 1)
if (pot >= x)
break;
You just need to find the most significant bit and shift it left once. Here's a Python implementation. I think x86 has an instruction to get the MSB, but here I'm implementing it all in straight Python. Once you have the MSB it's easy.
>>> def msb(n):
... result = -1
... index = 0
... while n:
... bit = 1 << index
... if bit & n:
... result = index
... n &= ~bit
... index += 1
... return result
...
>>> def next_pow(n):
... return 1 << (msb(n) + 1)
...
>>> next_pow(1)
2
>>> next_pow(2)
4
>>> next_pow(3)
4
>>> next_pow(4)
8
>>> next_pow(123)
128
>>> next_pow(222)
256
>>>
Forget this! It uses loop !
unsigned int nextPowerOf2 ( unsigned int u)
{
unsigned int v = 0x80000000; // supposed 32-bit unsigned int
if (u < v) {
while (v > u) v = v >> 1;
}
return (v << 1); // return 0 if number is too big
}
private static int nextHighestPower(int number){
if((number & number-1)==0){
return number;
}
else{
int count=0;
while(number!=0){
number=number>>1;
count++;
}
return 1<<count;
}
}
// n is the number
int min = (n&-n);
int nextPowerOfTwo = n+min;
#define nextPowerOf2(x, n) (x + (n-1)) & ~(n-1)
or even
#define nextPowerOf2(x, n) x + (x & (n-1))

How to get checksums for strided patterns

I have a 64 bit number (but only the 42 low order bits are used) and need to computer the sum of the 4 bits at n, n+m, n+m*2 and n+m*3 (note: anything that can produce a sum >4 is invalid) for some fixed m and every value of n that places all the bits in the number
as an example using m=3 and given the 16-bit number
0010 1011 0110 0001
I need to compute
2, 3, 1, 2, 3, 0, 3
Does anyone have any (cool) ideas for ways to do this? I'm fine with bit twiddling.
My current thought is to make bit shifted copies of the input to align the values to be summed and then build a logic tree to do a 4x 1bit adder.
v1 = In;
v2 = In<<3;
v3 = In<<6;
v4 = In<<9;
a1 = v1 ^ v2;
a2 = v1 & v2;
b1 = v3 ^ v4;
b2 = v3 & v4;
c2 = a1 & b1;
d2 = a2 ^ b2;
o1 = a1 ^ b1;
o2 = c2 ^ d2;
o4 = a2 & b2;
This does end up with the bits of the result spread across 3 different ints but oh well.
edit: as it happens I need the histogram of the sums so doing a bit-count of o4, o2&o1, o2 and o1 gives me what I want.
a second solution uses a perfect hash function
arr = [0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4];
for(int i = 0; i < N; i++)
{
out[i] = arr[(In & 0b1001001001) % 30];
In >>= 1;
}
This works by noting that the 4 selected bits can only take on 16 patterns and that (by guess and check) they can be hashed into 0-15 using mod 30. From there, a table of computed values gives the needed sum. As it happens only 3 of the 4 strides I need work this way.
p.s.
Correct trumps fast. Fast trumps clear. I expect to be running this millions of time.
Maybe I am crazy, but I am having fun :D
This solution is based upon the usage of data parallelism and faking a vector cpu without actually using SSE intrinsics or anything similar.
unsigned short out[64];
const unsigned long long mask = 0x0249024902490249ul;
const unsigned long long shiftmask = 0x0001000100010001ul;
unsigned long long t = (unsigned short)(in >> 38) | (unsigned long long)(unsigned short)(in >> 39) > 40) > 41) << 48;
t &= mask;
*((unsigned long long*)(out + 38)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);
[... snipsnap ...]
t = (unsigned short)(in >> 2) | (unsigned long long)(unsigned short)(in >> 3) > 4) > 5) << 48;
t &= mask;
*((unsigned long long*)(out + 2)) = (t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask);
t = (unsigned short)in | (unsigned long long)(unsigned short)(in >> 1) << 16;
t &= mask;
*((unsigned int*)out) = (unsigned int)((t & shiftmask) + (t >> 3 & shiftmask) + (t >> 6 & shiftmask) + (t >> 9 & shiftmask));
By reordering the computations, we can further reduce the execution time significantly, since it drastically reduces the amount of times that something is loaded into the QWORD. A few other optimizations are quite obvious and rather minor, but sum up to another interesting speedup.
unsigned short out[64];
const unsigned long long Xmask = 0x249024902490249ull;
const unsigned long long Ymask = 0x7000700070007u;
unsigned long long x = (in >> 14 & 0xFFFFu) | (in >> 20 & 0xFFFFu) > 26 & 0xFFFFu) > 32) << 48;
unsigned long long y;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[32] = (unsigned short)(y >> 48);
out[26] = (unsigned short)(y >> 32);
out[20] = (unsigned short)(y >> 16);
out[14] = (unsigned short)(y );
x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[33] = (unsigned short)(y >> 48);
out[27] = (unsigned short)(y >> 32);
out[21] = (unsigned short)(y >> 16);
out[15] = (unsigned short)(y );
[snisnap]
x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[37] = (unsigned short)(y >> 48);
out[31] = (unsigned short)(y >> 32);
out[25] = (unsigned short)(y >> 16);
out[19] = (unsigned short)(y );
x >>= 1;
x &= 0xFFFF000000000000ul;
x |= (in & 0xFFFFu) | (in >> 5 & 0xFFFFu) > 10 & 0xFFFFu) << 32;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[38] = (unsigned short)(y >> 48);
out[10] = (unsigned short)(y >> 32);
out[ 5] = (unsigned short)(y >> 16);
out[ 0] = (unsigned short)(y );
[snipsnap]
x >>= 1;
y = x & Xmask;
y += y >> 6;
y += y >> 3;
y &= Ymask;
out[ 9] = (unsigned short)(y >> 16);
out[ 4] = (unsigned short)(y );
Running times for 50 million executions in native c++ (all ouputs verified to match ^^) compiled as a 64 bit binary on my pc:
Array based solution: ~5700 ms
Naive hardcoded solution: ~4200 ms
The first solution: ~2400 ms
The second solution: ~1600 ms
A suggestion that I don't want to code right now is to use a loop, an array to hold partial results, and constants to pick up the bits m at a time.
loop
s[3*i] += x & (1 << 0);
s[3*i+1] += x & (1 << 1);
s[3*i+2] += x & (1 << 2);
x >> 3;
This will pick too many bits in each sum. But you can also keep track of the intermediate results and subtract from the sums as you go, to account for the bit that may not be there anymore.
loop
s[3*i] += p[3*i] = x & (1 << 0);
s[3*i+1] += p[3*i+1] = x & (1 << 1);
s[3*i+2] += p[3*i+2] = x & (1 << 2);
s[3*i] -= p[3*i-10];
s[3*i+1] -= p[3*i-9];
s[3*i+2] -= p[3*i-8];
x >> 3;
with the appropriate bounds checking, of course.
The fastest approach is to just hardcode the sums themselves.
s[0] = (x & (1<<0)) + (x & (1<<3)) + (x & (1<<6)) + (x & (1<<9));
etc. (The shifts occur at compile time.)