Conditional copying in CUDA, where data vector is longer than stencil - cuda

I would like to conditional copy data from vector, basing on stencil vector, which is N times shorter. Every element in stencil would be responsible for N elements in data vector.
Suppose that the vectors look as follows (N=3)
data = {1,2,3,4,5,6,7,8,9}
stencil = {1,0,1}
What I would like to get in result:
result = {1,2,3,7,8,9}
Is there a way to achieve this using functions from Thrust library?
I know, that there is:
thrust::copy_if (InputIterator1 first, InputIterator1 last, InputIterator2 stencil, OutputIterator result, Predicate pred)
but this doesn't allow me to copy N values from data vector basing on one element from stencil.

As is often the case, I imagine there are many possible ways to do this.
The approach which occurs to me (using copy_if) is to use the stencil vector as part of a thrust::permutation_iterator, that takes the stencil vector and generates the index into it using a thrust::transform_iterator. If we imagine a copying index that goes from 0..8 for this example, then we can index into the "source" (i.e. stencil) vector using a "map" index calculated using a thrust::counting_iterator with integer division by N (using thrust placeholders). The copying predicate just tests if the stencil value == 1.
The thrust quick start guide gives a concise description of how to use these fancy iterators.
Here is a worked example:
$ cat t471.cu
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
int main(){
int data[] = {1,2,3,4,5,6,7,8,9};
int stencil[] = {1,0,1};
int ds = sizeof(data)/sizeof(data[0]);
int ss = sizeof(stencil)/sizeof(stencil[0]);
int N = ds/ss; // assume this whole number divisible
thrust::device_vector<int> d_data(data, data+ds);
thrust::device_vector<int> d_stencil(stencil, stencil+ss);
thrust::device_vector<int> d_result(ds);
int rs = thrust::copy_if(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_stencil.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1 / N)), d_result.begin(), _1 == 1) - d_result.begin();
thrust::copy_n(d_result.begin(), rs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t471 t471.cu
$ ./t471
1,2,3,7,8,9,
$
With the assumptions about stencil organization made here, we could also pre-compute the result size rs with thrust::reduce, and use that to allocate the result vector size:
$ cat t471.cu
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
int main(){
int data[] = {1,2,3,4,5,6,7,8,9};
int stencil[] = {1,0,1};
int ds = sizeof(data)/sizeof(data[0]);
int ss = sizeof(stencil)/sizeof(stencil[0]);
int N = ds/ss; // assume this whole number divisible
thrust::device_vector<int> d_data(data, data+ds);
thrust::device_vector<int> d_stencil(stencil, stencil+ss);
int rs = thrust::reduce(d_stencil.begin(), d_stencil.end())*N;
thrust::device_vector<int> d_result(rs);
thrust::copy_if(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_stencil.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1 / N)), d_result.begin(), _1 == 1) - d_result.begin();
thrust::copy_n(d_result.begin(), rs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t471 t471.cu
$ ./t471
1,2,3,7,8,9,
$

Related

How to prevent thrust::copy_if illegal memory access

In a CUDA C++ code, I am using thrust::copy_if to copy those integer values that are not -1 from the array x to y:
thrust::copy_if(x_ptr, x_ptr + N, y_ptr, is_not_minus_one());
I put the code in a try/catch and I update the array x regularly, and again push numbers that are not -1 to the end of the y until it accessed to the out of the array and returns
CUDA Runtime Error: an illegal memory access was encountered
As I don't know how many values are not -1 in each iteration, I should keep generating and copying numbers in the main array until it becomes full. However, I wanna it to stop once it reaches the end of the array. How can I manage it?
A naive idea would be using another array and then copying if the number of new values does not exceed the size. but it might be inefficient. Any better idea?
Here is one possible method:
use remove_if() instead, (with the opposite condition - remove if -1) on x. Then use (x start and) the returned iterator to copy to your final array (y). Every time you do a copy to y, you will know how much space is left in y, and you can adjust the copy quantity if needed, before doing the copy.
Example (removing 1 values):
$ cat t2088.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/complex.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/functional.h>
#include <iostream>
using mt = int;
using namespace thrust::placeholders;
int main(){
thrust::device_vector<mt> a(5);
thrust::device_vector<mt> c(25);
bool done = false;
int empty_spaces = c.size();
int copy_start_index = 0;
while (!done){
thrust::sequence(a.begin(), a.end());
int num_to_copy = (thrust::remove_if(a.begin(), a.end(), _1 == 1) - a.begin());
int actual_num_to_copy = std::min(num_to_copy, empty_spaces);
if (actual_num_to_copy != num_to_copy) done = true;
thrust::copy_n(a.begin(), actual_num_to_copy, c.begin()+copy_start_index);
copy_start_index += actual_num_to_copy;
empty_spaces -= actual_num_to_copy;
}
std::cout << "output array is full!" << std::endl;
thrust::host_vector<mt> h_c = c;
thrust::copy(h_c.begin(), h_c.end(), std::ostream_iterator<mt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t2088 t2088.cu
$ compute-sanitizer ./t2088
========= COMPUTE-SANITIZER
output array is full!
0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,
========= ERROR SUMMARY: 0 errors
$

How to order a vector using some predefined sequence in Thrust

I have a predefined sequence of elements like this in a vector, the vector contains thousands of elements :
207.1 226.1 229.1 231.1 210.1 239.1 235.1 201.1 247.1 245.1 197.1 203.1 246.1 249.1 196.1 248.1 244.1 238.1
In a different vector I have the same elements as in the predefined vector but in a scattered way like this
226.1 225.1 205.1 220.1 220.1 237.1 226.1 212.1 212.1 205.1 205.1 202.1 202.1 192.1 192.1 191.1 191.1 192.1 192.1 192.1
Now I want to club up the elements in the scattered vector so that the order of the predefined vector is maintained, so the result should be like this
207.1 207.1 207.1 226.1 226.1 226.1 226.1 229.1 229.1 229.1 229.1 . . .
Is there any way to do this using CUDA thrust?
I'll make a few assumptions that I think are necessary:
Your first "predefined" sequence has no duplicates. If there were duplicates, and they were not adjacent, I cannot come up with an ordering strategy
Your second "scattered" sequence does not have any elements which are not also in the first sequence. If there were, I would have no idea where to place these or how to order them
With those assumptions, here is one possible method, using the above definitions of "first" and "second" sequences:
For the first sequence, provide a vector of the same length (the "index" sequence) that indicates the index of the value:
207.1 226.1 229.1 231.1 ...
0 1 2 3 ...
Perform a sort_by_key to order the first sequence. The index sequence will now be scrambled.
Using the ordered first sequence, use thrust::lower_bound on the second sequence, to find which value of the first sequence it matches.
Using the matching value's index for each element in the second sequence (via thrust::permutation_iterator), sort_by_key the second sequence by matching value index.
Here is an example:
$ cat t41.cu
#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/iterator/permutation_iterator.h>
#include <iostream>
#include <cstdlib>
const int max_samples_per_element = 4;
typedef float dt;
typedef int it;
int main(){
// seq 1 and 2 data setup
// host
dt seq_1[] = {207.1, 226.1, 229.1, 231.1, 210.1, 239.1, 235.1, 201.1, 247.1, 245.1, 197.1, 203.1, 246.1, 249.1, 196.1, 248.1, 244.1, 238.1};
it seq_1_sz = sizeof(seq_1)/sizeof(seq_1[0]);
thrust::host_vector<dt> seq_2_hv;
for (int i = seq_1_sz-1; i >= 0; i--){
int element_samples = rand()%max_samples_per_element;
element_samples++;
for (int j = 0; j < element_samples; j++)
seq_2_hv.push_back(seq_1[i]);
}
// device
thrust::device_vector<dt> seq_2_dv = seq_2_hv;
thrust::device_vector<dt> seq_1_dv(seq_1, seq_1+seq_1_sz);
thrust::device_vector<it> index(seq_1_dv.size());
thrust::device_vector<it> index2(seq_2_dv.size());
thrust::sequence(index.begin(), index.end());
//process data
thrust::sort_by_key(seq_1_dv.begin(), seq_1_dv.end(), index.begin());
thrust::lower_bound(seq_1_dv.begin(), seq_1_dv.end(), seq_2_dv.begin(), seq_2_dv.end(), index2.begin());
auto my_pi = thrust::make_permutation_iterator(index.begin(), index2.begin());
thrust::sort_by_key(my_pi, my_pi+index2.size(), seq_2_dv.begin());
// display results
thrust::host_vector<dt> result = seq_2_dv;
thrust::copy_n(seq_1, seq_1_sz, std::ostream_iterator<dt>(std::cout, ","));
std::cout << std::endl;
thrust::copy(result.begin(), result.end(), std::ostream_iterator<dt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -arch=sm_35 -o t41 t41.cu -O3 -lineinfo -Wno-deprecated-gpu-targets -std=c++14
$ ./t41
207.1,226.1,229.1,231.1,210.1,239.1,235.1,201.1,247.1,245.1,197.1,203.1,246.1,249.1,196.1,248.1,244.1,238.1,
207.1,207.1,207.1,226.1,229.1,229.1,229.1,231.1,231.1,231.1,231.1,210.1,210.1,210.1,210.1,239.1,239.1,239.1,235.1,235.1,235.1,235.1,201.1,201.1,201.1,247.1,247.1,245.1,245.1,197.1,203.1,203.1,203.1,246.1,246.1,246.1,246.1,249.1,249.1,196.1,196.1,196.1,196.1,248.1,248.1,244.1,244.1,244.1,238.1,238.1,238.1,238.1,
$
I'm not suggesting the above code is defect-free or suitable for any particular purpose. My objective here is to demonstrate a possible method, not provide a fully tested code.

Rank of each element in a matrix row using CUDA

Is there any way to find the rank of element in a matrix row separately using CUDA or any functions for the same provided by NVidia?
I don't know of a built-in ranking or argsort function in CUDA or any of the libraries I am familiar with.
You could certainly build such a function out of lower-level operations using thrust for example.
Here is a (non-optimized) outline of a possible solution approach using thrust:
$ cat t84.cu
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/functional.h>
#include <thrust/adjacent_difference.h>
#include <thrust/transform.h>
#include <thrust/iterator/permutation_iterator.h>
#include <iostream>
typedef int mytype;
struct clamp
{
template <typename T>
__host__ __device__
T operator()(T data){
if (data == 0) return 0;
return 1;}
};
int main(){
mytype data[] = {4,1,7,1};
int dsize = sizeof(data)/sizeof(data[0]);
thrust::device_vector<mytype> d_data(data, data+dsize);
thrust::device_vector<int> d_idx(dsize);
thrust::device_vector<int> d_result(dsize);
thrust::sequence(d_idx.begin(), d_idx.end());
thrust::sort_by_key(d_data.begin(), d_data.end(), d_idx.begin(), thrust::less<mytype>());
thrust::device_vector<int> d_diff(dsize);
thrust::adjacent_difference(d_data.begin(), d_data.end(), d_diff.begin());
d_diff[0] = 0;
thrust::transform(d_diff.begin(), d_diff.end(), d_diff.begin(), clamp());
thrust::inclusive_scan(d_diff.begin(), d_diff.end(), d_diff.begin());
thrust::copy(d_diff.begin(), d_diff.end(), thrust::make_permutation_iterator(d_result.begin(), d_idx.begin()));
thrust::copy(d_result.begin(), d_result.end(), std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -arch=sm_61 -o t84 t84.cu
$ ./t84
1,0,2,0,
$
If you are in CUDA, the concept rank is not the same as the one on other languages as openmp or mpi. On that case you will need to go on a global block of the code you need to work with threadIdx.x and blockIdx.x parameters

get reverse_iterator from device_ptr in CUDA

For a device_vector, I can use its rbegin() method to get its reverse iterator. But how to construct a reverse iterator directly from device_ptr?
May be this can be achieved by constructing a device_vector with the device_ptr, the code is as follows:
thrust::device_ptr<int> ptr = get_ptr();
thrust::device_vector<int> tmpVector(ptr , ptr + N)
thrust::inclusive_scan_by_key(tmpVector.rbegin(), tmpVector.rend(), ......);
But I don't know if thrust::device_vector<int> tmpVector(ptr , ptr + N) will construct a new vector and copy the data from ptr or it just reserve a reference from ptr? The documentation of Thrust doesn't mension this.
Any ideas?
Providing an answer based on the comment by Jared, to get this off the unanswered list, and to preserve the question for future readers.
To make a reverse iterator from any kind of iterator, including thrust::device_ptr, use the thrust::make_reverse_iterator function.
Here is a simple example:
$ cat t615.cu
#include <thrust/device_vector.h>
#include <thrust/iterator/reverse_iterator.h>
#include <thrust/device_ptr.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <iostream>
#define DSIZE 4
int main(){
int *data;
cudaMalloc(&data, DSIZE*sizeof(int));
thrust::device_ptr<int> my_data = thrust::device_pointer_cast<int>(data);
thrust::sequence(my_data, my_data+DSIZE);
thrust::copy_n(my_data, DSIZE, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
typedef thrust::device_vector<int>::iterator Iterator;
thrust::reverse_iterator<Iterator> r_iter = make_reverse_iterator(my_data+DSIZE); // note that we point the iterator to the "end" of the device pointer area
thrust::copy_n(r_iter, DSIZE, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -arch=sm_35 -o t615 t615.cu
$ ./t615
0,1,2,3,
3,2,1,0,
$
The creation of a reverse iterator does not create any "extra array".

storing return value of thrust reduce_by_key on device vectors

I have been trying to use thrust function reduce_by_key on device vectors. In the documentation they have given example on host_vectors instead of any device vector. The main problem I am getting is in storing the return value of the function. To be more specific here is my code:
thrust::device_vector<int> hashValueVector(10)
thrust::device_vector<int> hashKeysVector(10)
thrust::device_vector<int> d_pathId(10);
thrust::device_vector<int> d_freqError(10); //EDITED
thrust::pair<thrust::detail::normal_iterator<thrust::device_ptr<int> >,thrust::detail::normal_iterator<thrust::device_ptr<int> > > new_end; //THE PROBLEM
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
I tried declaring them as device_ptr first in the definition since for host_vectors too they have used pointers in the definition in the documentation. But I am getting compilation error when I try that then I read the error statement and converted the declaration to the above, this is compiling fine but I am not sure whether this is the right way to define or not.
I there any other standard/clean way of declaring that (the "new_end" variable)? Please comment if my question is not clear somewhere.
EDIT: I have edited the declaration of d_freqError. It was supposed to be int I wrote it as hashElem by mistake, sorry for that.
There is a problem in the setup of your reduce_by_key operation:
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
Notice that in the documentation it states:
OutputIterator2 is a model of Output Iterator and and InputIterator2's value_type is convertible to OutputIterator2's value_type.
The value type of your InputIterator2 (i.e. hashValueVector.begin()) is int . The value type of your OutputIterator2 is struct hashElem. Thrust is not going to know how to convert an int to a struct hashElem.
Regarding your question, it should not be difficult to capture the return entity from reduce_by_key. According to the documentation it is a thrust pair of two iterators, and these iterators should be consistent with (i.e. of the same vector type and value type) as your keys iterator type and your values iterator type, respectively.
Here's an updated sample based on what you posted, which compiles cleanly:
$ cat t353.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/pair.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>
#include <thrust/fill.h>
#include <thrust/copy.h>
typedef thrust::device_vector<int>::iterator dIter;
int main(){
thrust::device_vector<int> hashValueVector(10);
thrust::device_vector<int> hashKeysVector(10);
thrust::device_vector<int> d_pathId(10);
thrust::device_vector<int> d_freqError(10);
thrust::sequence(hashValueVector.begin(), hashValueVector.end());
thrust::fill(hashKeysVector.begin(), hashKeysVector.begin()+5, 1);
thrust::fill(hashKeysVector.begin()+6, hashKeysVector.begin()+10, 2);
thrust::pair<dIter, dIter> new_end;
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
std::cout << "Number of results are: " << new_end.first - d_pathId.begin() << std::endl;
thrust::copy(d_pathId.begin(), new_end.first, std::ostream_iterator<int>(std::cout, "\n"));
thrust::copy(d_freqError.begin(), new_end.second, std::ostream_iterator<int>(std::cout, "\n"));
}
$ nvcc -arch=sm_20 -o t353 t353.cu
$ ./t353
Number of results are: 3
1
0
2
10
5
30
$