CUDA thrust: how to realize "partition" that supports "stencil"? - cuda

Suppose there is an array of intergers:
A[]={2, 2, 9, 8, 5, 7, 0, 6}
and a stencil:
B[]={1, 0, 0, 1, 1, 1, 0, 1}
My question is how could we rearrange A[] according to B[] such that if B[i]==1, B[j]==0, then A[i] will be guaranteed to precede A[j] in the new array, which should look like:
C[]={2, 8, 5, 7, 6, 2, 9, 0}
PS: I found the "partition" function was almost the answer except that it only supported predicate. Is there any workaround?
Any hint is much appreciated!

This can be implemented using thrust::stable_sort_by_key().

Now that thrust::partition and thrust::stable_partition with stencil have been implemented (one may need to get the source from the official Thrust repository), this can be achieved with:
#include <thrust/partition.h>
struct is_one
{
__host__ __device__
bool operator()(const int &x)
{
return x == 1;
}
};
// Partition values on device thanks to stencil
thrust::stable_partition(d_A.begin(),
d_A.end(),
d_B.begin(),
is_one());
Which leads to:
A = 0 1 2 3 4 5 6 7 8 9
B = 0 1 1 0 0 1 0 0 1 0
C = 1 2 5 8 0 3 4 6 7 9
This implementation is more efficient since we are not sorting the values in the two partitions. A similar and more complex example is available here (with some more details in the answer).

Related

atomicAdd - CUDA function

I apply atomicAdd function to add 10 in each array component
The results are Not identical to my expection.
Could you tell me why the value of list[1] is 12, while I expect 11=1+10. Total threads are 5. The initial array values are
slist[0]=1
slist[1]=2
slist[2]=3
slist[3]=4
slist[4]=5
the results are
list[0]= 1, list[0]= 1
list[0]= 1, list[1]= 12
list[0]= 1, list[2]= 13
list[0]= 1, list[3]= 14
list[0]= 1, list[4]= 15
__global__ void RunAtomicAdd(int* slist, int* val)
{
int id = threadIdx.x;
slist[0] = atomicAdd((slist +id), 10);
printf("list[0]= %d, list[%d]= %d \n", slist[0], id, slist[id]);
}
Note that atomicAdd does not return the updated value, instead it returns the old value: cuda atomicAdd example fails to yield correct output
So all of your outputs are expected. In slist[0], even if you update the value with atomicAdd, you immediately overwrite it with the output of atomicAdd, the old value. This does not happen with the rest of the id, except they do indeed store 1 in slist[0], all of them.
You may want to have a new array to store the result of atomicAdd.

Thrust adapting thrust::remove_if so predicate is checking for existence in range [duplicate]

I'm using CUDA and THRUST to perform paired set operations. I would like to retain duplicates, however. For example:
int keys[6] = {1, 1, 1, 3, 4, 5, 5};
int vals[6] = {1, 2, 3, 4, 5, 6, 7};
int comp[2] = {1, 5};
thrust::set_intersection_by_key(keys, keys + 6, comp, comp + 2, vals, rk, rv);
Desired result
rk[1, 1, 1, 5, 5]
rv[1, 2, 3, 6, 7]
Actual Result
rk[1, 5]
rv[5, 7]
I want all of the vals where the corresponding key is contained in comp.
Is there any way to achieve this using thrust, or do I have to write my own kernel or thrust function?
I'm using this function: set_intersection_by_key.
Quoting from the thrust documentation:
The generalization is that if an element appears m times in [keys_first1, keys_last1) and n times in [keys_first2, keys_last2) (where m may be zero), then it appears min(m,n) times in the keys output range
Since comp does only contain each key once, n=1 and therefore min(m,1) = 1.
In order to get "all of the vals where the corresponding key is contained in comp", you can use the approach of my answer to a similar problem.
Similarly, the example code does the following steps:
Get the largest element of d_comp. This assumes that d_comp is already sorted.
Create vector d_map of size largest_element+1. Copy 1 to all positions of the entries of d_comp in d_map.
Copy all entries from d_vals for which there is a 1 entry in d_map into d_result.
#include <thrust/device_vector.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/functional.h>
#include <thrust/copy.h>
#include <thrust/scatter.h>
#include <iostream>
#define PRINTER(name) print(#name, (name))
void print(const char* name, const thrust::device_vector<int>& v)
{
std::cout << name << ":\t";
thrust::copy(v.begin(), v.end(), std::ostream_iterator<int>(std::cout, "\t"));
std::cout << std::endl;
}
int main()
{
int keys[] = {1, 1, 1, 3, 4, 5, 5};
int vals[] = {1, 2, 3, 4, 5, 6, 7};
int comp[] = {1, 5};
const int size_data = sizeof(keys)/sizeof(keys[0]);
const int size_comp = sizeof(comp)/sizeof(comp[0]);
// copy data to GPU
thrust::device_vector<int> d_keys (keys, keys+size_data);
thrust::device_vector<int> d_vals (vals, vals+size_data);
thrust::device_vector<int> d_comp (comp, comp+size_comp);
PRINTER(d_keys);
PRINTER(d_vals);
PRINTER(d_comp);
int largest_element = d_comp.back();
thrust::device_vector<int> d_map(largest_element+1);
thrust::constant_iterator<int> one(1);
thrust::scatter(one, one+size_comp, d_comp.begin(), d_map.begin());
PRINTER(d_map);
thrust::device_vector<int> d_result(size_data);
using namespace thrust::placeholders;
int final_size = thrust::copy_if(d_vals.begin(),
d_vals.end(),
thrust::make_permutation_iterator(d_map.begin(), d_keys.begin()),
d_result.begin(),
_1
) - d_result.begin();
d_result.resize(final_size);
PRINTER(d_result);
return 0;
}
output:
d_keys: 1 1 1 3 4 5 5
d_vals: 1 2 3 4 5 6 7
d_comp: 1 5
d_map: 0 1 0 0 0 1
d_result: 1 2 3 6 7

thrust: fill isolate space

I have an array like this:
0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
I want every non-zero elements to expand themselves one element at a time until it reaches other non-zero elements, the result is like this:
1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
Is there any way to do this using thrust?
Is there any way to do this using thrust?
Yes, here is one possible approach.
For each position in the sequence, compute 2 distances. The first is the distance to the nearest non-zero value in the left direction, and the second is the distance to the nearest non-zero value in the right direction. If the position itself is non-zero, both left and right distances will be computed as zero. Our basic engine for this will be segmented inclusive scans, one computed in the left to right direction (to compute the distance from the left for each zero segment), and the other computed in the reverse direction (to compute the distance from the right for each zero segment). Using your example:
a vector: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
a left dist: ? ? ? 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2
a right dist:3 2 1 0 4 3 2 1 0 2 1 0 3 2 1 0 ? ?
Note that in each distance computation, we must special-case one end if that end does not happen to begin with a non-zero value (because the distance from that direction is "undefined"). We will special case those ? distances by assigning them large values, the reason for which will become evident in the next step.
We now will create a "map" vector, which, for each output position, allows us to select an element from the original input vector that belongs in that output position. This map vector is computed by taking the lesser of the two computed distances, and adjusting the index either from the left or the right, by that distance:
output index: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
a left dist: ? ? ? 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2
a right dist: 3 2 1 0 4 3 2 1 0 2 1 0 3 2 1 0 ? ?
map vector: 3 3 3 3 3 3 8 8 8 8 11 11 11 11 15 15 15 15
For the map vector computation, if a left dist > a right dist then we take the output index and add a right dist to it, to produce the map vector element at that position. Otherwise, we take the output index and subtract a left dist from it. Note that the special-case ? entries above should be considered to be "arbitrarily large" for this computation. This is simulated in the code by using a large integer (1<<30).
Once we have the map vector, it's a trivial matter to use it to do a mapped copy from input to output vectors:
a vector: 0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
map vector: 3 3 3 3 3 3 8 8 8 8 11 11 11 11 15 15 15 15
out vector: 1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
Here is a fully worked example:
$ cat t610.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/scan.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/functional.h>
#include <thrust/transform.h>
#include <thrust/sequence.h>
#include <iostream>
#define IVAL (1<<30)
// used to create input vector for prefix sums (distance vector computation)
struct is_zero {
template <typename T>
__host__ __device__
T operator() (T val) {
return (val) ? 0:1;
}
};
// inc and dec help with special casing of left and right ends
struct inc {
template <typename T>
__host__ __device__
T operator() (T val) {
return val+IVAL;
}
};
struct dec {
template <typename T>
__host__ __device__
T operator() (T val) {
return val-IVAL;
}
};
// this functor is lifted from thrust example code
// and is used to enable segmented scans based on flag delimitors
// BinaryPredicate for the head flag segment representation
// equivalent to thrust::not2(thrust::project2nd<int,int>()));
template <typename HeadFlagType>
struct head_flag_predicate : public thrust::binary_function<HeadFlagType,HeadFlagType,bool>
{
__host__ __device__
bool operator()(HeadFlagType left, HeadFlagType right) const
{
return !right;
}
};
// distance tuple ordering is left (0), then right (1)
struct map_functor
{
template <typename T>
__host__ __device__
int operator() (T dist){
int leftdist = thrust::get<0>(dist);
int rightdist = thrust::get<1>(dist);
int idx = thrust::get<2>(dist);
return (leftdist > rightdist) ? (idx+rightdist):(idx-leftdist);
}
};
int main(){
int h_a[] = { 0, 0, 0, 1, 0, 0, 0, 0, 5, 0, 0, 3, 0, 0, 0, 8, 0, 0 };
int n = sizeof(h_a)/sizeof(h_a[0]);
thrust::device_vector<int> a(h_a, h_a+n);
thrust::device_vector<int> az(n);
thrust::device_vector<int> asl(n);
thrust::device_vector<int> asr(n);
thrust::transform(a.begin(), a.end(), az.begin(), is_zero());
// set up distance from the left vector (asl)
thrust::transform_if(az.begin(), az.begin()+1, a.begin(), az.begin(),inc(), is_zero());
thrust::transform(a.begin(), a.begin()+1, a.begin(), inc());
thrust::inclusive_scan_by_key(a.begin(), a.end(), az.begin(), asl.begin(), head_flag_predicate<int>());
thrust::transform(a.begin(), a.begin()+1, a.begin(), dec());
thrust::transform_if(az.begin(), az.begin()+1, a.begin(), az.begin(), dec(), is_zero());
// set up distance from the right vector (asr)
thrust::device_vector<int> ra(n);
thrust::sequence(ra.begin(), ra.end(), n-1, -1);
thrust::transform_if(az.end()-1, az.end(), a.end()-1, az.end()-1, inc(), is_zero());
thrust::transform(a.end()-1, a.end(), a.end()-1, inc());
thrust::inclusive_scan_by_key(thrust::make_permutation_iterator(a.begin(), ra.begin()), thrust::make_permutation_iterator(a.begin(), ra.end()), thrust::make_permutation_iterator(az.begin(), ra.begin()), thrust::make_permutation_iterator(asr.begin(), ra.begin()), head_flag_predicate<int>());
thrust::transform(a.end()-1, a.end(), a.end()-1, dec());
// create combined map vector
thrust::device_vector<int> map(n);
thrust::counting_iterator<int> idxbegin(0);
thrust::transform(thrust::make_zip_iterator(thrust::make_tuple(asl.begin(), asr.begin(), idxbegin)), thrust::make_zip_iterator(thrust::make_tuple(asl.end(), asr.end(), idxbegin+n)), map.begin(), map_functor());
// use map to create output
thrust::device_vector<int> result(n);
thrust::copy(thrust::make_permutation_iterator(a.begin(), map.begin()), thrust::make_permutation_iterator(a.begin(), map.end()), result.begin());
// display results
std::cout << "Input vector:" << std::endl;
thrust::copy(a.begin(), a.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::cout << "Output vector:" << std::endl;
thrust::copy(result.begin(), result.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
}
$ nvcc -arch=sm_20 -o t610 t610.cu
$ ./t610
Input vector:
0 0 0 1 0 0 0 0 5 0 0 3 0 0 0 8 0 0
Output vector:
1 1 1 1 1 1 5 5 5 5 3 3 3 3 8 8 8 8
$
Notes:
The above implementation probably has areas that can be improved on, particularly with respect to fusion of operations. However, for understanding purposes, I think fusion makes the code a bit harder to read.
I have really only tested it on the particular example you gave. There may be bugs that you will uncover. My purpose is not to give you a black-box library function that you use but don't understand, but rather to teach you how to write your own code that does what you want.
The "ambiguity" pointed out by JackOLantern is still present in your problem statement. I have obscured it by choosing my map functor behavior to mimic the output you indicated as desired, but simply by creating an equally valid but opposite realization of the map functor (using "if a left dist < a right dist then ..." instead) I can cause the result between 3 and 8 to take the other possible outcome/state. Your comment that "if there is an ambiguity, whoever reaches the position first fill its value to that space" makes no sense to me, unless by that you mean "I don't care which outcome you provide." There is no concept of a particular thread reaching a particular point first. Threads (and blocks) can execute in any order, and this order can change from device to device, and run to run.

Binary divisibility by 10

How to check if a binary number can be divided by 10 (decimal), without converting it to other system.
For example, we have a number:
1010 1011 0100 0001 0000 0100
How we can check that this number is divisible by 10?
First split the number into odd and even bits (I'm calling "even" the
bits corresponding to even powers of 2):
100100110010110000000101101110
0 1 0 1 0 0 1 0 0 0 1 1 0 1 0 even 1 0 0 1 0 1 1 0 0 0 0 0 1 1 1 odd
Now in each of these, add and subtract the digits alternately, as in
the standard test for divisibility by 11 in decimal (starting with
addition at the right):
100100110010110000000101101110 +0-1+0-1+0-0+1-0+0-0+1-1+0-1+0 =
-2 +1-0+0-1+0-1+1-0+0-0+0-0+1-1+1 = 1
Now double the sum of the odd digits and add it to the sum of the even
digits:
2*1 + -2 = 0
If the result is divisible by 5, as in this case, the number itself is
divisible by 5.
Since this number is also divisible by 2 (the rightmost digit being
0), it is divisible by 10.
Link
If you are talking about computational methods, you can do a divisiblity-by-5 test and a divisibility-by-2 test.
The numbers below assume unsigned 32-bit arithmetic, but can easily be extended to larger numbers.
I'll provide some code first, followed by a more textual explanation:
unsigned int div5exact(unsigned int n)
{
// returns n/5 as long as n actually divides 5
// (because 'n * (INV5 * 5)' == 'n * 1' mod 2^32
#define INV5 0xcccccccd
return n * INV5;
}
unsigned int divides5(unsigned int n)
{
unsigned int q = div5exact(n);
if (q <= 0x33333333) /* q*5 < 2^32? */
{
/* q*5 doesn't overflow, so n == q*5 */
return 1;
}
else
{
/* q*5 overflows, so n != q*5 */
return 0;
}
}
int divides2(unsigned int n)
{
/* easy divisibility by 2 test */
return (n & 1) == 0;
}
int divides10(unsigned int n)
{
return divides2(n) && divides5(n);
}
/* fast one-liner: */
#define DIVIDES10(n) ( ((n) & 1) == 0 && ((n) * 0xcccccccd) <= 0x33333333 )
Divisibility by 2 is easy: (n&1) == 0 means that n is even.
Divisibility by 5 involves multiplying by the inverse of 5, which is 0xcccccccd (because 0xcccccccd * 5 == 0x400000001, which is just 0x1 if you truncate to 32 bits).
When you multiply n*5 by the inverse of 5, you get n * 5*(inverse of 5), which in 32-bit math simplifies to n*1 .
Now let's say n and q are 32-bit numbers, and q = n*(inverse of 5) mod 232.
Because n is no greater than 0xffffffff, we know that n/5 is no greater than (232-1)/5 (which is 0x33333333). Therefore, we know if q is less than or equal to (232-1)/5, then we know n divides exactly by 5, because q * 5 doesn't get truncated in 32 bits, and is therefore equal to n, so n divides q and 5.
If q is greater than (232-1)/5, then we know it doesn't divide 5, because there is a one-one mapping between the 32-bit numbers divisible by 5 and the numbers between 0 and (232-1)/5, and so any number out of this range doesn't map to a number that's divisible by 5.
Here is the code in python to check the divisibilty by 10 using bitwise technique
#taking input in string which is a binary number eg: 1010,1110
s = input()
#taking initial value of x as o
x = 0
for i in s:
if i == '1':
x = (x*2 + 1) % 10
else:
x = x*2 % 10
#if x is turn to be 0 then it is divisible by 10
if x:
print("Not divisible by 10")
else:
print("Divisible by 10")

Zig Zag Decoding

In the google protocol buffers encoding overview, they introduce something called "Zig Zag Encoding", this takes signed numbers, which have a small magnitude, and creates a series of unsigned numbers which have a small magnitude.
For example
Encoded => Plain
0 => 0
1 => -1
2 => 1
3 => -2
4 => 2
5 => -3
6 => 3
And so on. The encoding function they give for this is rather clever, it's:
(n << 1) ^ (n >> 31) //for a 32 bit integer
I understand how this works, however, I cannot for the life of me figure out how to reverse this and decode it back into signed 32 bit integers
Try this one:
(n >> 1) ^ (-(n & 1))
Edit:
I'm posting some sample code for verification:
#include <stdio.h>
int main()
{
unsigned int n;
int r;
for(n = 0; n < 10; n++) {
r = (n >> 1) ^ (-(n & 1));
printf("%u => %d\n", n, r);
}
return 0;
}
I get following results:
0 => 0
1 => -1
2 => 1
3 => -2
4 => 2
5 => -3
6 => 3
7 => -4
8 => 4
9 => -5
Here's yet another way of doing the same, just for explanation purposes (you should obviously use 3lectrologos' one-liner).
You just have to notice that you xor with a number that is either all 1's (equivalent to bitwise not) or all 0's (equivalent to doing nothing). That's what (-(n & 1)) yields, or what is explained by google's "arithmetic shift" remark.
int zigzag_to_signed(unsigned int zigzag)
{
int abs = (int) (zigzag >> 1);
if (zigzag % 2)
return ~abs;
else
return abs;
}
unsigned int signed_to_zigzag(int signed)
{
unsigned int abs = (unsigned int) signed << 1;
if (signed < 0)
return ~abs;
else
return abs;
}
So in order to have lots of 0's on the most significant positions, zigzag encoding uses the LSB as sign bit, and the other bits as the absolute value (only for positive integers actually, and absolute value -1 for negative numbers due to 2's complement representation).
How about
(n>>1) - (n&1)*n
After fiddling with the accepted answer proposed by 3lectrologos, I couldn't get it to work when starting with unsigned longs (in C# -- compiler error). I came up with something similar instead:
( value >> 1 ) ^ ( ~( value & 1 ) + 1 )
This works great for any language that represents negative numbers in 2's compliment (e.g. .NET).
I have found a solution, unfortunately it's not the one line beauty I was hoping for:
uint signMask = u << 31;
int iSign = *((Int32*)&signMask);
iSign >>= 31;
signMask = *((UInt32*)&iSign);
UInt32 a = (u >> 1) ^ signMask;
return *((Int32*)&a);
I'm sure there's some super-efficient bitwise operations that do this faster, but the function is straightforward. Here's a python implementation:
def decode(n):
if (n < 0):
return (2 * abs(n)) - 1
else:
return 2 * n
>>> [decode(n) for n in [0,-1,1,-2,2,-3,3,-4,4]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]