CUDA Thrust: How to do a maximum-reduce operation with a mask? - cuda

I have a long vector of doubles x[]. I have another long vector of bools xMask[]. They have the same size. I would like to use Thrust to compute the maximum value of x[], but only for those elements where xMask[] is true. For example:
x = [1, 2, 3, 4, 5, 6, 7, 8]
xMask = [true, false, true, false, true, false, true, false]
The Maximum-Reduce of x[] with xMask[] is 7 (not 8, because that value of xMask[] is false).
Can I easily do this in Thrust?

As of now, there isn't a function named reduce_if in Thrust, which would be what you are searching for. There are multiple ways of doing this with the given functions and which way is best for you problem will probably depend on the ratio of trues to falses in the mask and how they are distributed.
That being said, the canonical way of achieving this is using transform_reduce together with a zip_iterator:
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/transform_reduce.h>
#include <thrust/zip_iterator.h>
int reduce_if(thrust::device_vector<int> const &data,
thrust::device_vector<bool> const &mask) {
return thrust::transform_reduce(
thrust::make_zip_iterator(thrust::make_tuple(
data.cbegin(), mask.cbegin())),
thrust::make_zip_iterator(thrust::make_tuple(
data.cend(), mask.cend())),
[](const thrust::tuple<int, bool> &elem){
return thrust::get<1>(elem) ? thrust::get<0>(elem) : 0;
},
0,
thrust::plus<int>{});
}

Related

Thrust: Stream compaction copying only first N valid elements

I have a const thrust vector of elements from which I would like to extract at most N elements that pass a predicate (in any order), where the thrust vector size and N are known at compile-time. In my specific case, my vector is 500k elements and N is 100k.
My initial thought was to use thrust::copy_if to get all elements that pass the predicate, then to use only the first N elements for my subsequent calculations. However, in that case I would have to allocate two vectors of 500k elements (one for the initial vector, and one for the output of copy_if) and I'd have to process every element.
As this is an operation I have to do many times and across several CUDA streams, I would like to know if there is a way to obtain the N output elements while minimizing the memory footprint required, and ideally, minimizing the number of elements that need to be processed (i.e. breaking the process once N valid elements have been found).
One possible method to perform a stream compaction operation is to perform a predicated prefix-sum followed by a conditional indexed copy. By breaking a "monolithic" operation into these 2 pieces, it becomes fairly easy to insert the desired limiting behavior on output size.
The prefix sum is a fairly involved operation. We will use thrust for that. The conditional indexed copy is fairly trivial, so we will write our own CUDA kernel for that, rather than try to wrestle with a thrust::copy_if operation to get the copy logic just right. This kernel is where we will insert the limiting behavior on the output size.
Here is a worked example:
$ cat t34.cu
#include <thrust/scan.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef int mt;
__global__ void my_copy(mt *d, int *i, mt *r, int limit, int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size){
if ((idx == 0) && (*i == 1) && (limit > 0))
*r = *d;
else if ((idx > 0) && (i[idx] > i[idx-1]) && (i[idx] <= limit)){
r[i[idx]-1] = d[idx];}
}
}
int main(){
int rs = 3;
mt d[] = {0, 1, 0, 2, 0, 3, 0, 4, 0, 5};
int ds = sizeof(d)/sizeof(d[0]);
thrust::device_vector<mt> data(d, d+ds);
thrust::device_vector<int> idx(ds);
thrust::device_vector<mt> result(rs);
auto my_cmp = thrust::make_transform_iterator(data.begin(), 0+(_1>0));
thrust::inclusive_scan(my_cmp, my_cmp+ds, idx.begin());
my_copy<<<(ds+255)/256, 256>>>(thrust::raw_pointer_cast(data.data()), thrust::raw_pointer_cast(idx.data()), thrust::raw_pointer_cast(result.data()), rs, ds);
thrust::host_vector<mt> h_result = result;
thrust::copy_n(h_result.begin(), rs, std::ostream_iterator<mt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -std=c++14 -o t34 t34.cu -arch=sm_52
$ ./t34
1,2,3,
$
(CUDA 11.0, Fedora 29, GTX 960)
Note that this code is provided for demonstration purposes. You should not assume that it is defect-free or suitable for any particular purpose. Use it at your own risk.
A bit of study with a profiler will show that the thrust::inclusive_scan operation does perform a cudaMalloc and cudaFree operation "under the hood". So even though we have pulled most of the allocations "out into the open" here, thrust apparently still needs to perform a single temporary allocation (of unknown size) to support the scan operation.
Responding to a question in the comments below. To understand this: 0+(_1>0), there are two things to note:
The general syntax is using thrust::placeholders. This capability of thrust allows us to write simple unary or binary functions inline, avoiding the need to use lambdas or write separate functors.
The reason for the 0+ is as follows. If we simply used (_1>0), then thrust would use as its unary function a boolean test of the item returned by dereferencing the iterator, compared to zero. The result of that comparison is a boolean, and if we leave it that way, the prefix sum will ultimately be computed using boolean arithmetic, which we do not want. We want the result of the boolean greater-than test (i.e. true/false) to be converted to an integer, so that the subsequent prefix sum gets performed using integer arithmetic. Prepending the (_1>0) boolean test with 0+ accomplishes that.

Igraph calculating minimum spanning tree with weights C interface

I have been trying to calculate a minimum spanning tree using the prim method, but I have got rather confused about the way that weights are used in this context. The suggested example program in the source documents does not appear to be correct, I don't understand why the edge betweenness needs to be calculated.
Please see the following program, it's designed to make a simple undirected graph.
#include <igraph.h>
int main()
{
igraph_vector_t eb, edges;
igraph_vector_t weights;
long int i;
igraph_t theGraph, tree;
struct arg {
int index;
int source;
int target;
float weight;
};
struct arg data[] = {
{0, 0, 1, 2.0},
{1, 1, 2, 3.0},
{2, 2, 3, 44.0},
{3, 3, 4, 3.0},
{4, 4, 1, 2.0},
{5, 4, 5, 9.0},
{6, 4, 6, 3.0},
{6, 6, 5, 7.0}
};
int nargs = sizeof(data) / sizeof(struct arg);
igraph_empty(&theGraph, nargs, IGRAPH_UNDIRECTED);
igraph_vector_init(&weights, nargs);
// create graph
for (i = 0; i < nargs; i++) {
igraph_add_edge(&theGraph, data[i].source, data[i].target);
// Add an weight per entry
igraph_vector_set(&weights, i, data[i].weight);
}
igraph_vector_init(&eb, igraph_ecount(&theGraph));
igraph_edge_betweenness(&theGraph, &eb, IGRAPH_UNDIRECTED, &weights);
for (i = 0; i < igraph_vector_size(&eb); i++) {
VECTOR(eb)[i] = -VECTOR(eb)[i];
}
igraph_minimum_spanning_tree_prim(&theGraph, &tree, &eb);
igraph_write_graph_edgelist(&tree, stdout);
igraph_vector_init(&edges, 0);
igraph_minimum_spanning_tree(&theGraph, &edges, &eb);
igraph_vector_print(&edges);
igraph_vector_destroy(&edges);
igraph_destroy(&tree);
igraph_destroy(&theGraph);
igraph_vector_destroy(&eb);
return 0;
}
Can anybody see anything that is wrong with this program it's designed to build a simple graph with what I hope is the correct way to use a weight argument. One value per edge between a source and a target.
The section about adding an edge betweenness comes from the original code example for the use of prim. It just needs to be removed for the program to work correctly using a user supply value of weight.

When I use thrust::counting_iterator how can I select the backend for Thrust 1.7 (CUDA 5.5)?

Thrust automatically selects the GPU backend when I provide an algorithm with iterators from thrust::device_vector, since the vector's data lives on the GPU. However, when I only provide thrust::counting_iterator parameters to an algorithm, how can I select which backend it executes on?
In the following invocation of thrust::find, there are no device_vector iterator arguments, so how does Thrust choose which backend (CPU, OMP, TBB, CUDA) to use?
How can I control on which backend this algorithm executes without using thrust::device_vector<> in this code?
thrust::counting_iterator<uint64_t> first(i);
thrust::counting_iterator<uint64_t> last = first + step_size;
auto iter = thrust::find(
thrust::make_transform_iterator(first, functor),
thrust::make_transform_iterator(last, functor),
true);
UPDATE 23.01.14. MSVS2012, CUDA5.5, Thrust 1.7:
Compile success!
#include <iostream>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/find.h>
#include <thrust/functional.h>
#include <thrust/execution_policy.h>
struct is_odd : public thrust::unary_function<uint64_t, bool> {
__host__ __device__ bool operator()(uint64_t const& x) {
return x & 1;
}
};
int main() {
thrust::counting_iterator<uint64_t> first(0);
thrust::counting_iterator<uint64_t> last = first + 100;
auto iter = thrust::find(thrust::device,
thrust::make_transform_iterator(first, is_odd()),
thrust::make_transform_iterator(last, is_odd()),
true);
int bbb; std::cin >> bbb;
return 0;
}
Sometimes where a Thrust algorithm executes can be ambiguous, as in your counting_iterator example, because its associated "backend system" is thrust::any_system_tag (a counting_iterator can be dereferenced anywhere because it is not backed by data). In situations like this, Thrust will use the device backend. By default, this will be CUDA. However, you can explicitly control how execution happens in a couple of ways.
You can either explicitly specify the system through the template parameter as in ngimel's answer, or you can provide the thrust::device execution policy as the first argument to thrust::find in your example:
#include <thrust/execution_policy.h>
...
thrust::counting_iterator<uint64_t> first(i);
thrust::counting_iterator<uint64_t> last = first + step_size;
auto iter = thrust::find(thrust::device,
thrust::make_transform_iterator(first, functor),
thrust::make_transform_iterator(last, functor),
true);
This technique requires Thrust 1.7 or better.
You have to specify System template parameter when instantiating counting_iterator:
typedef thrust::device_system_tag System;
thrust::counting_iterator<uint64_t,System> first(i)
If you are using the current version of Thrust, please follow the way Jared Hoberock mentioned. But if you might use older versions (the system that you work at might have old version of CUDA) then the example below might help.
#include <thrust/version.h>
#if THRUST_MINOR_VERSION > 6
#include <thrust/execution_policy.h>
#elif THRUST_MINOR_VERSION == 6
#include <thrust/iterator/retag.h>
#else
#endif
...
#if THRUST_MINOR_VERSION > 6
total =
thrust::transform_reduce(
thrust::host
, thrust::counting_iterator<unsigned int>(0)
, thrust::counting_iterator<unsigned int>(N)
, AFunctor(), 0, thrust::plus<unsigned int>());
#elif THRUST_MINOR_VERSION == 6
total =
thrust::transform_reduce(
thrust::retag<thrust::host_system_tag>(thrust::counting_iterator<unsigned int>(0))
, thrust::retag<thrust::host_system_tag>(thrust::counting_iterator<unsigned int>(N))
, AFunctor(), 0, thrust::plus<unsigned int>());
#else
total =
thrust::transform_reduce(
thrust::counting_iterator<unsigned int, thrust::host_space_tag>(0)
, thrust::counting_iterator<unsigned int, thrust::host_space_tag>(objectCount)
, AFunctor(), 0, thrust::plus<unsigned int>());
#endif
#see Thrust: How to directly control where an algorithm invocation executes?

Eigen library: return a matrix block in a function as lvalue

I am trying to return a block of a matrix as an lvalue of a function. Let's say my function looks like this:
Block<Derived> getBlock(MatrixXd & m, int i, int j, int row, int column)
{
return m.block(i,j,row,column);
}
As it turns out, it seems that C++ compiler understands that block() operator gives only temporary value and so returning it as an lvalue is prohibited by the compiler. However, in Eigen documentation there is some example that we can use Eigen as an lvalue (http://eigen.tuxfamily.org/dox/TutorialBlockOperations.html#TutorialBlockOperationsUsing) so I am wondering how we couldn't do the same with function return.
a.block(0,0,2,3) = a.block(2,1,2,3);
Thank you!
I want to put what I found myself so it might be helpful to someone else:
My basic solution is to know what derived type you want the block to be. In this case:
Block<MatrixXd> getBlock(MatrixXd & m, int i, int j, int row, int column)
{
return m.block(i,j,row,column);
}
It is interesting to me to notice that this method will return the reference to the content of matrix m by default. So if we do:
MatrixXd m = MatrixXd::Zero(10,10);
Block<MatrixXd> myBlock = getBlock(m, 1, 1, 3, 3);
myBlock << 1, 0, 0,
0, 1, 0,
0, 0, 1;
The content in matrix m will be modified as well. Note that, however,
MatrixXd m = MatrixXd::Zero(10,10);
MatrixXd myBlock = getBlock(m, 1, 1, 3, 3);
myBlock << 1, 0, 0,
0, 1, 0,
0, 0, 1;
will not work. My understanding is that once we convert the block to another type Eigen makes a copy of the data before conversion.
I was trying something like this, specifically returning the last 3 elements of a 4 element vector and I couldn't get this to work.
Solution turned out kind of nice, although maybe a little confusing if you're not familiar with trailing return types:
struct foo{
Eigen::Vector4d e_;
// const version
auto get_tail() const -> const auto { return e_.tail<3>(); };
// non-const version
auto get_tail() -> auto { return e_.tail<3>(); };
};

CUDA: Getting max value and its index in an array

I have several blocks were each block executes on separate part of an integer array. As an example: block one from array[0] to array[9] and block two from array[10] to array[20].
What is the best way i can get the index of the max value of the array for each block?
Example block one a[0] to a[10] have the following values:
5 10 2 3 4 34 56 3 9 10
So 56 is the largest value at index 6.
I cannot use the shared memory because the size of the array may be very big. Therefore it won't fit. Are there any libraries that allows me to do so fast?
I know about the reduction algorithm, but i think my case is different because i want to get the index of the largest element.
If I understood exactly what you want is : Get the index for the array A of the max value inside it.
If that is true then I would suggest you to use the thrust library:
Here is how you would do it:
#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <time.h>
using namespace thrust;
// return the biggest of two tuples
template <class T>
struct bigger_tuple {
__device__ __host__
tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b)
{
if (a > b) return a;
else return b;
}
};
template <class T>
int max_index(device_vector<T>& vec) {
// create implicit index sequence [0, 1, 2, ... )
counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());
tuple<T,int> init(vec[0],0);
tuple<T,int> smallest;
smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)),
init, bigger_tuple<T>());
return get<1>(smallest);
}
int main(){
thrust::host_vector<int> h_vec(1024);
thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
int index = max_index(d_vec);
std::cout << "Max index is:" << index <<std::endl;
std::cout << "Value is: " << h_vec[index] <<std::endl;
return 0;
}
This will not benefit the original poster but for those who came to this page looking for an answer I would second the recommendation to use thrust that already has a function thrust::max_element that does exactly that - returns an index of the largest element. min_element and minmax_element functions are also provided. See thrust documentation for details here.
As well as the suggestion to use Thrust, you could also use the CUBLAS cublasIsamax function.
The size of your array in comparison to shared memory is almost irrelevant, since the number of threads in each block is the limiting factor rather than the size of the array. One solution is to have each thread block work on a size of the array the same size as the thread block. That is, if you have 512 threads, then block n will be looking at array[ n ] thru array[ n + 511 ]. Each block does a reduction to find the highest member in that portion of the array. Then you bring the max of each section back to the host and do a simple linear search to locate the highest value in the overall array. Each reduction no the GPU reduces the linear search by a factor of 512. Depending on the size of the array, you might want to do more reductions before you bring the data back. (If your array is 3*512^10 in size, you might want to do 10 reductions on the gpu, and have the host search through the 3 remaining data points.)
One thing to watch out for when doing a max value plus index reduction is that if there is more than one identical valued maximum element in your array, i.e. in your example if there were 2 or more values equal to 56, then the index which is returned would not be unique and possibly be different on every run of the code because the timing of the thread ordering over the GPU is not deterministic.
To get around this kind of problem you can use a unique ordering index such as threadid + threadsperblock * blockid, or else the element index location if that is unique. Then the max test is along these lines:
if(a>max_so_far || a==max_so_far && order_a>order_max_so_far)
{
max_so_far = a;
index_max_so_far = index_a;
order_max_so_far = order_a;
}
(index and order can be the same variable, depending on the application.)