I have two sets A & B. The result(C) of my operation should have elements in A which are not there in B. I use set_difference to do it. However the size of result(C) has to be set before the operation. Else it has extra zeros at the end, like below:
A=
1 2 3 4 5 6 7 8 9 10
B=
1 2 8 11 7 4
C=
3 5 6 9 10 0 0 0 0 0
How to set the size of result(C) dynamically so that output is C= 3 5 6 9. In a real problem, I would not know the required size of result device_vector apriori.
My code:
#include <thrust/execution_policy.h>
#include <thrust/set_operations.h>
#include <thrust/sequence.h>
#include <thrust/execution_policy.h>
#include <thrust/device_vector.h>
void remove_common_elements(thrust::device_vector<int> A, thrust::device_vector<int> B, thrust::device_vector<int>& C)
{
thrust::sort(thrust::device, A.begin(), A.end());
thrust::sort(thrust::device, B.begin(), B.end());
thrust::set_difference(thrust::device, A.begin(), A.end(), B.begin(), B.end(), C.begin());
}
int main(int argc, char * argv[])
{
thrust::device_vector<int> A(10);
thrust::sequence(thrust::device, A.begin(), A.end(),1); // x components of the 'A' vectors
thrust::device_vector<int> B(6);
B[0]=1;B[1]=2;B[2]=8;B[3]=11;B[4]=7;B[5]=4;
thrust::device_vector<int> C(A.size());
std::cout << "A="<< std::endl;
thrust::copy(A.begin(), A.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
std::cout << "B="<< std::endl;
thrust::copy(B.begin(), B.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
remove_common_elements(A, B, C);
std::cout << "C="<< std::endl;
thrust::copy(C.begin(), C.end(), std::ostream_iterator<int>(std::cout, " "));
std::cout << std::endl;
return 0;
}
In the general case (i.e. across various thrust algorithms) there is often no way to know the output size, except what the upper bound would be. The usual approach here would be to pass a result vector whose size is the upper bound of the possible output size. As you stated already, in many cases the actual size of the output cannot be known a-priori. Thrust has no particular magic to solve this. After the operation, you will know the size of the result, and it could be copied to a new vector if the "extra zeroes" were a problem for some reason (I can't think of a reason why they would be a problem generally, except that they use up allocated space).
If this is highly objectionable, one possibility (copying this information from a response by Jared Hoberock in another forum) is to run the algorithm twice, the first time using a discard_iterator (for the output data) and the second time with a real iterator, pointing to an actual vector allocation, of the requisite size. During the first pass, the discard_iterator is used to count the size of the actual result data, even though it is not stored anywhere. Quoting directly from Jared:
In the first phase, pass a discard_iterator as the output iterator. You can compare the discard_iterator returned as the result to compute the size of the output. In the second phase, call the algorithm "for real" and output into an array sized using the result of the first phase.
The technique is demonstrated in the set_operations.cu example [0,1]:
[0] https://github.com/thrust/thrust/blob/master/examples/set_operations.cu#L25
[1] https://github.com/thrust/thrust/blob/master/examples/set_operations.cu#L127
thrust::set_difference returns an iterator to the end of the resulting range.
If you just want to change the logical size of C to the number of resulting elements, you could simply erase the range "behind" the result range.
void remove_common_elements(thrust::device_vector<int> A,
thrust::device_vector<int> B, thrust::device_vector<int>& C)
{
thrust::sort(thrust::device, A.begin(), A.end());
thrust::sort(thrust::device, B.begin(), B.end());
auto C_end = thrust::set_difference(thrust::device, A.begin(), A.end(), B.begin(), B.end(), C.begin());
C.erase(C_end, C.end());
}
I run into this issue over and over in CUDA. I have, for a set of elements, done some GPU calculation. This results in some value that has linear meaning (for instance, in terms of memory):
element_sizes = [ 10, 100, 23, 45 ]
And now, for the next stage of GPU calculation, I need the following values:
memory_size = sum(element_sizes)
memory_offsets = [ 0, 10, 110, 133 ]
I can calculate memory_size at 80 gbps on my GPU using the reduction code available from NVIDIA. However, I can't use this code, as it uses a branching technique that does not compose the memory offsets array. I have tried many things, but what I have found is that simply copying over elements_sizes to the host and calculating the offsets with a simd for loop is the simplest, fastest, way to go:
// in pseudo code
host_element_sizes = copy_to_host(element_sizes);
host_offsets = (... *) malloc(...);
int total_size = 0;
for(int i = 0; i < ...; ...){
host_offsets[i] = total_size;
total_size += host_element_sizes[i];
}
device_offsets = (... *) device_malloc(...);
device_offsets = copy_to_device(host_offsets,...);
However, I have done this many times now, and it is starting to become a bottleneck. This seems like a typical problem, but I have found no work-around.
What is the expected way for a CUDA programmer to solve this problem?
I think the algorithm you are looking for is a prefix sum. A prefix sum on a vector produces another vector which contains the cumulative sum values of the input vector. A prefix sum exists in at least two variants - an exclusive scan or an inclusive scan. Conceptually these are similar.
If your element_sizes vector has been deposited in GPU global memory (it appears to be the case based on your pseudocode), then there exist library functions that run on the GPU that you could call at that point, to produce the memory_offsets data (vector), and the memory_size value could be trivially obtained from the last value in the vector, with a slight variation based on whether you are doing an inclusive scan or exclusive scan.
Here's a trivial worked example using thrust:
$ cat t319.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <iostream>
int main(){
const int element_sizes[] = { 10, 100, 23, 45 };
const int ds = sizeof(element_sizes)/sizeof(element_sizes[0]);
thrust::device_vector<int> dv_es(element_sizes, element_sizes+ds);
thrust::device_vector<int> dv_mo(ds);
thrust::exclusive_scan(dv_es.begin(), dv_es.end(), dv_mo.begin());
std::cout << "element_sizes:" << std::endl;
thrust::copy_n(dv_es.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_offsets:" << std::endl;
thrust::copy_n(dv_mo.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_size:" << std::endl << dv_es[ds-1] + dv_mo[ds-1] << std::endl;
}
$ nvcc -o t319 t319.cu
$ ./t319
element_sizes:
10,100,23,45,
memory_offsets:
0,10,110,133,
memory_size:
178
$
I'm experiencing odd behavior while using the thrust::reverse function on a zip_iterator constructed with a thrust::make_zip_iterator( thrust::make_tuple( )) type syntax (see the answer from JackOLantern here for a good example of that combination).
I wish to reverse some arbitrarily-indicated section of multiple device vectors as in the example code below. When I do the reversing in one go by tupling and zipping them together, unexpected behavior ensues. The first half of the range is correctly changed to an inversion of the second half of the range, however, the second half of the range is left unchanged.
I've been using other thrust functions in a similar fashion (sort_by_key, uniqe_by_key, adjacent_difference, etc.) without issue. Am I just executing this incorrectly or is there some reason that this will not work on a fundamental level? A thought I had is that perhaps the zip_iterator is not bidirectional as required for reverse. Is this true? I couldn't find documentation indicating as such.
A workaround is just to reverse the vector individually, which works as shown below. However, I suspect this will be less efficient. Note that in my actual use-case I have vectors with sizes of the order of 10,000 and I'm zipping up anywhere from 3-7 vectors for the operations.
#include <iostream>
#include <ostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/tuple.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/sequence.h>
#include <thrust/reverse.h>
int main(){
// initial host vectors
const int N=10;
thrust::host_vector<int> h1(N);
thrust::host_vector<float> h2(N);
// fill them
thrust::sequence( h1.begin(), h1.end(), 0);
thrust::sequence( h2.begin(), h2.end(), 10., 0.5);
// print initial contents
for (size_t i=0; i<N; i++){
std::cout << h1[i] << " " << h2[i] << std::endl;
}
// transfer to device
thrust::device_vector<int> d1 = h1;
thrust::device_vector<float> d2 = h2;
// what chunk to invert
int iStart = 3; int iEnd = 8;
// attempt to reverse middle via zip_iterators
thrust::reverse(
thrust::make_zip_iterator( thrust::make_tuple( d1.begin()+iStart, d2.begin()+iStart)),
thrust::make_zip_iterator( thrust::make_tuple( d1.begin()+iEnd, d2.begin()+iEnd))
);
// pull back and write out unexpected ordering
thrust::host_vector<int> temp1 = d1;
thrust::host_vector<float> temp2 = d2;
std::cout << "<==========>" << std::endl;
for (size_t i=0; i<N; i++){
std::cout << temp1[i] << " " << temp2[i] << std::endl;
}
// reset device variables
d1 = h1;
d2 = h2;
// reverse individually
thrust::reverse( d1.begin()+iStart, d1.begin()+iEnd);
thrust::reverse( d2.begin()+iStart, d2.begin()+iEnd);
// pull back and write out the desired ordering
temp1 = d1;
temp2 = d2;
std::cout << "<==========>" << std::endl;
for (size_t i=0; i<N; i++){
std::cout << temp1[i] << " " << temp2[i] << std::endl;
}
return 0;
}
Output
0 10
1 10.5
2 11
3 11.5
4 12
5 12.5
6 13
7 13.5
8 14
9 14.5
<==========>
0 10
1 10.5
2 11
7 13.5
6 13
5 12.5
6 13
7 13.5
8 14
9 14.5
<==========>
0 10
1 10.5
2 11
7 13.5
6 13
5 12.5
4 12
3 11.5
8 14
9 14.5
The information from Robert Crovella in the comments combined with the initially given workaround in the initial post appears to answer the question - thus, I will combine them here so the question can be marked as "answered." If others wish to post other solutions, I'm more than willing to look at them and move the "official answer" check mark. That being said...
The solution to the question has two parts:
If using an older version of CUDA and upgrading is an option: upgrade to the newest CUDA version and the operation should work (tested to work on CUDA 9.2.148 - thanks Robert!)
If unable to upgrade to a newer version of CUDA: apply reverse to the vectors individually to achieve the same result as given in the initial post. The code with only the working solution is copied below for completeness.
#include <iostream>
#include <ostream>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/tuple.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/sequence.h>
#include <thrust/reverse.h>
int main(){
// initial host vectors
const int N=10;
thrust::host_vector<int> h1(N);
thrust::host_vector<float> h2(N);
// fill them
thrust::sequence( h1.begin(), h1.end(), 0);
thrust::sequence( h2.begin(), h2.end(), 10., 0.5);
// print initial contents
for (size_t i=0; i<N; i++){
std::cout << h1[i] << " " << h2[i] << std::endl;
}
// transfer to device
thrust::device_vector<int> d1 = h1;
thrust::device_vector<float> d2 = h2;
// what chunk to invert
int iStart = 3; int iEnd = 8;
// reverse individually
thrust::reverse( d1.begin()+iStart, d1.begin()+iEnd);
thrust::reverse( d2.begin()+iStart, d2.begin()+iEnd);
// pull back and write out the desired ordering
temp1 = d1;
temp2 = d2;
std::cout << "<==========>" << std::endl;
for (size_t i=0; i<N; i++){
std::cout << temp1[i] << " " << temp2[i] << std::endl;
}
return 0;
}
Let us say that I have a single array which stores time stamps for multiple events. For example, T1_e1, T2_e1,....,T1_e2, T2_e2, T3_e2,.....T1_eN, T2,eN,..
I know that Thrust offers a function which computes adjacent differences, but here I need to do it for multiple events. Basically, constructing multiple histograms from a single input array.
So the output would have N different histograms (one for each event) like this:
histogram bins for e1, histogram bins for e2, histogram bins for e3,....histogram bins for eN.
Input1 (timestamps): 100, 101, 104, 105, 101,104, 106, 111, 90, 91, 93, 94,95
Input2 (events): 4123,4123,4123,4123,2129,2129,2129,2129,300,300,300,300,300
output: 4123:(1,2),(2,0),(3,1),(4,0),(5,0)
2129:(1,0),(2,1),(3,1),(4,0),(5,1)
300: (1,2),(2,1),(3,0),(4,),(5,0)
The number of bins will be fixed, i.e. 5 bins per histogram.
Regarding the tuples: (x,y) -> x is the difference between two consecutive time stamps belonging to the same event. y is the count.
If we consider event 4123, the first tuple is (1,2), because the difference between 101 and 100 is 1, and 105 and 104 is 1. So there are two time stamp differences which belong to this bin, hence (1,2).
Can someone please suggest the most efficient way to do this. So far, it seems that I will have to write my own code. But if there are existing solutions, I would like to try them first.
Here's one possible approach that computes the sparse histogram (i.e. non-zero bins only). It should not be difficult to convert a sparse histogram to a dense histogram (zero and non-zero bins included) using thrust scattering.
compute the timestamp differences using thrust::adjacent_difference and a special functor (my_adj_diff) that computes adjacent differences only for like events.
use thrust::remove_if to remove the zero values (one per event) at the start of each event sequence (created by the functor in step 1).
combine events and timestamp differences into a single integer for histogramming using thrust::transform. This just multiplies the event by 100 in my example, and adds the timestamp difference (assumed to be a set of bins of less than 100 max bin).
use the sparse histogram method from the thrust histogram example code.
Here's a fully worked example. The data does not quite match your expected output (non-zero values only) because you have an error in your expected output.
$ cat t678.cu
#include <thrust/adjacent_difference.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/constant_iterator.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <thrust/remove.h>
#include <thrust/sort.h>
#include <thrust/inner_product.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <iostream>
#define ZIP(X,Y) thrust::make_zip_iterator(thrust::make_tuple(X,Y))
#define SCALE 100
struct my_adj_diff
{
template <typename T>
__host__ __device__
T operator()(T &d2, T &d1) const
{
if (thrust::get<1>(d1) == thrust::get<1>(d2)) {
thrust::get<0>(d2) -= thrust::get<0>(d1);}
else {
thrust::get<0>(d2) = 0;}
return d2;
}
};
struct my_is_zero
{
template <typename T>
__host__ __device__
bool operator()(const T &d1) const
{
return (thrust::get<0>(d1) == 0);
}
};
struct my_combine
{
template <typename T>
__host__ __device__
T operator()(const T &d1, const T &d2) const
{
return (d1*SCALE)+d2;
}
};
// sparse histogram using reduce_by_key
// modified from: https://github.com/thrust/thrust/blob/master/examples/histogram.cu
template <typename Vector1,
typename Vector2,
typename Vector3>
void sparse_histogram(const Vector1& input,
Vector2& histogram_values,
Vector3& histogram_counts)
{
typedef typename Vector1::value_type ValueType; // input value type
typedef typename Vector3::value_type IndexType; // histogram index type
thrust::device_vector<ValueType> data(input);
// sort data to bring equal elements together
thrust::sort(data.begin(), data.end());
// number of histogram bins is equal to number of unique values (assumes data.size() > 0)
IndexType num_bins = thrust::inner_product(data.begin(), data.end() - 1,
data.begin() + 1,
IndexType(1),
thrust::plus<IndexType>(),
thrust::not_equal_to<ValueType>());
// resize histogram storage
histogram_values.resize(num_bins);
histogram_counts.resize(num_bins);
// compact find the end of each bin of values
thrust::reduce_by_key(data.begin(), data.end(),
thrust::constant_iterator<IndexType>(1),
histogram_values.begin(),
histogram_counts.begin());
}
int main(){
int tstamps[] = { 100, 101, 104, 105, 101,104, 106, 111, 90, 91, 93, 94,95 };
int mevents[] = {4123,4123,4123,4123,2129,2129,2129,2129,300,300,300,300,300};
int dsize = sizeof(tstamps)/sizeof(int);
thrust::host_vector<int> h_stamps(tstamps, tstamps+dsize);
thrust::host_vector<int> h_events(mevents, mevents+dsize);
thrust::device_vector<int> d_stamps = h_stamps;
thrust::device_vector<int> d_events = h_events;
thrust::device_vector<int> diffs(dsize);
// compute timestamp differences by event
thrust::adjacent_difference(ZIP(d_stamps.begin(), d_events.begin()), ZIP(d_stamps.end(), d_events.end()), ZIP(d_stamps.begin(), d_events.begin()), my_adj_diff());
d_stamps[0] = 0; // fix up first event for adjacent_difference
int sz1 = thrust::remove_if(ZIP(d_stamps.begin(), d_events.begin()), ZIP(d_stamps.end(), d_events.end()), my_is_zero()) - ZIP(d_stamps.begin(), d_events.begin());
d_stamps.resize(sz1);
d_events.resize(sz1);
// pack events and timestamps into a single vector - assumes max bin (time difference) is less than SCALE
thrust::device_vector<int> d_data(sz1);
thrust::transform(d_events.begin(), d_events.end(), d_stamps.begin(), d_data.begin(), my_combine());
// compute histogram
thrust::device_vector<int> histogram_values;
thrust::device_vector<int> histogram_counts;
sparse_histogram(d_data, histogram_values, histogram_counts);
thrust::copy(histogram_values.begin(), histogram_values.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
thrust::copy(histogram_counts.begin(), histogram_counts.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc t678.cu -o t678
$ ./t678
30001,30002,212902,212903,212905,412301,412303,
3,1,1,1,1,2,1,
$
Note that you will need to use CUDA 7.0 (not CUDA 7.0 RC or any earlier version of CUDA) or else download the latest thrust master branch from github, because older versions of thrust have an issue when attempting to use zip iterators with thrust::adjacent_difference.