How to prevent thrust::copy_if illegal memory access - cuda

In a CUDA C++ code, I am using thrust::copy_if to copy those integer values that are not -1 from the array x to y:
thrust::copy_if(x_ptr, x_ptr + N, y_ptr, is_not_minus_one());
I put the code in a try/catch and I update the array x regularly, and again push numbers that are not -1 to the end of the y until it accessed to the out of the array and returns
CUDA Runtime Error: an illegal memory access was encountered
As I don't know how many values are not -1 in each iteration, I should keep generating and copying numbers in the main array until it becomes full. However, I wanna it to stop once it reaches the end of the array. How can I manage it?
A naive idea would be using another array and then copying if the number of new values does not exceed the size. but it might be inefficient. Any better idea?

Here is one possible method:
use remove_if() instead, (with the opposite condition - remove if -1) on x. Then use (x start and) the returned iterator to copy to your final array (y). Every time you do a copy to y, you will know how much space is left in y, and you can adjust the copy quantity if needed, before doing the copy.
Example (removing 1 values):
$ cat t2088.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/complex.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/functional.h>
#include <iostream>
using mt = int;
using namespace thrust::placeholders;
int main(){
thrust::device_vector<mt> a(5);
thrust::device_vector<mt> c(25);
bool done = false;
int empty_spaces = c.size();
int copy_start_index = 0;
while (!done){
thrust::sequence(a.begin(), a.end());
int num_to_copy = (thrust::remove_if(a.begin(), a.end(), _1 == 1) - a.begin());
int actual_num_to_copy = std::min(num_to_copy, empty_spaces);
if (actual_num_to_copy != num_to_copy) done = true;
thrust::copy_n(a.begin(), actual_num_to_copy, c.begin()+copy_start_index);
copy_start_index += actual_num_to_copy;
empty_spaces -= actual_num_to_copy;
}
std::cout << "output array is full!" << std::endl;
thrust::host_vector<mt> h_c = c;
thrust::copy(h_c.begin(), h_c.end(), std::ostream_iterator<mt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t2088 t2088.cu
$ compute-sanitizer ./t2088
========= COMPUTE-SANITIZER
output array is full!
0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,2,3,4,0,
========= ERROR SUMMARY: 0 errors
$

Related

How to order a vector using some predefined sequence in Thrust

I have a predefined sequence of elements like this in a vector, the vector contains thousands of elements :
207.1 226.1 229.1 231.1 210.1 239.1 235.1 201.1 247.1 245.1 197.1 203.1 246.1 249.1 196.1 248.1 244.1 238.1
In a different vector I have the same elements as in the predefined vector but in a scattered way like this
226.1 225.1 205.1 220.1 220.1 237.1 226.1 212.1 212.1 205.1 205.1 202.1 202.1 192.1 192.1 191.1 191.1 192.1 192.1 192.1
Now I want to club up the elements in the scattered vector so that the order of the predefined vector is maintained, so the result should be like this
207.1 207.1 207.1 226.1 226.1 226.1 226.1 229.1 229.1 229.1 229.1 . . .
Is there any way to do this using CUDA thrust?
I'll make a few assumptions that I think are necessary:
Your first "predefined" sequence has no duplicates. If there were duplicates, and they were not adjacent, I cannot come up with an ordering strategy
Your second "scattered" sequence does not have any elements which are not also in the first sequence. If there were, I would have no idea where to place these or how to order them
With those assumptions, here is one possible method, using the above definitions of "first" and "second" sequences:
For the first sequence, provide a vector of the same length (the "index" sequence) that indicates the index of the value:
207.1 226.1 229.1 231.1 ...
0 1 2 3 ...
Perform a sort_by_key to order the first sequence. The index sequence will now be scrambled.
Using the ordered first sequence, use thrust::lower_bound on the second sequence, to find which value of the first sequence it matches.
Using the matching value's index for each element in the second sequence (via thrust::permutation_iterator), sort_by_key the second sequence by matching value index.
Here is an example:
$ cat t41.cu
#include <thrust/sort.h>
#include <thrust/binary_search.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <thrust/iterator/permutation_iterator.h>
#include <iostream>
#include <cstdlib>
const int max_samples_per_element = 4;
typedef float dt;
typedef int it;
int main(){
// seq 1 and 2 data setup
// host
dt seq_1[] = {207.1, 226.1, 229.1, 231.1, 210.1, 239.1, 235.1, 201.1, 247.1, 245.1, 197.1, 203.1, 246.1, 249.1, 196.1, 248.1, 244.1, 238.1};
it seq_1_sz = sizeof(seq_1)/sizeof(seq_1[0]);
thrust::host_vector<dt> seq_2_hv;
for (int i = seq_1_sz-1; i >= 0; i--){
int element_samples = rand()%max_samples_per_element;
element_samples++;
for (int j = 0; j < element_samples; j++)
seq_2_hv.push_back(seq_1[i]);
}
// device
thrust::device_vector<dt> seq_2_dv = seq_2_hv;
thrust::device_vector<dt> seq_1_dv(seq_1, seq_1+seq_1_sz);
thrust::device_vector<it> index(seq_1_dv.size());
thrust::device_vector<it> index2(seq_2_dv.size());
thrust::sequence(index.begin(), index.end());
//process data
thrust::sort_by_key(seq_1_dv.begin(), seq_1_dv.end(), index.begin());
thrust::lower_bound(seq_1_dv.begin(), seq_1_dv.end(), seq_2_dv.begin(), seq_2_dv.end(), index2.begin());
auto my_pi = thrust::make_permutation_iterator(index.begin(), index2.begin());
thrust::sort_by_key(my_pi, my_pi+index2.size(), seq_2_dv.begin());
// display results
thrust::host_vector<dt> result = seq_2_dv;
thrust::copy_n(seq_1, seq_1_sz, std::ostream_iterator<dt>(std::cout, ","));
std::cout << std::endl;
thrust::copy(result.begin(), result.end(), std::ostream_iterator<dt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -arch=sm_35 -o t41 t41.cu -O3 -lineinfo -Wno-deprecated-gpu-targets -std=c++14
$ ./t41
207.1,226.1,229.1,231.1,210.1,239.1,235.1,201.1,247.1,245.1,197.1,203.1,246.1,249.1,196.1,248.1,244.1,238.1,
207.1,207.1,207.1,226.1,229.1,229.1,229.1,231.1,231.1,231.1,231.1,210.1,210.1,210.1,210.1,239.1,239.1,239.1,235.1,235.1,235.1,235.1,201.1,201.1,201.1,247.1,247.1,245.1,245.1,197.1,203.1,203.1,203.1,246.1,246.1,246.1,246.1,249.1,249.1,196.1,196.1,196.1,196.1,248.1,248.1,244.1,244.1,244.1,238.1,238.1,238.1,238.1,
$
I'm not suggesting the above code is defect-free or suitable for any particular purpose. My objective here is to demonstrate a possible method, not provide a fully tested code.

Conditional copying in CUDA, where data vector is longer than stencil

I would like to conditional copy data from vector, basing on stencil vector, which is N times shorter. Every element in stencil would be responsible for N elements in data vector.
Suppose that the vectors look as follows (N=3)
data = {1,2,3,4,5,6,7,8,9}
stencil = {1,0,1}
What I would like to get in result:
result = {1,2,3,7,8,9}
Is there a way to achieve this using functions from Thrust library?
I know, that there is:
thrust::copy_if (InputIterator1 first, InputIterator1 last, InputIterator2 stencil, OutputIterator result, Predicate pred)
but this doesn't allow me to copy N values from data vector basing on one element from stencil.
As is often the case, I imagine there are many possible ways to do this.
The approach which occurs to me (using copy_if) is to use the stencil vector as part of a thrust::permutation_iterator, that takes the stencil vector and generates the index into it using a thrust::transform_iterator. If we imagine a copying index that goes from 0..8 for this example, then we can index into the "source" (i.e. stencil) vector using a "map" index calculated using a thrust::counting_iterator with integer division by N (using thrust placeholders). The copying predicate just tests if the stencil value == 1.
The thrust quick start guide gives a concise description of how to use these fancy iterators.
Here is a worked example:
$ cat t471.cu
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
int main(){
int data[] = {1,2,3,4,5,6,7,8,9};
int stencil[] = {1,0,1};
int ds = sizeof(data)/sizeof(data[0]);
int ss = sizeof(stencil)/sizeof(stencil[0]);
int N = ds/ss; // assume this whole number divisible
thrust::device_vector<int> d_data(data, data+ds);
thrust::device_vector<int> d_stencil(stencil, stencil+ss);
thrust::device_vector<int> d_result(ds);
int rs = thrust::copy_if(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_stencil.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1 / N)), d_result.begin(), _1 == 1) - d_result.begin();
thrust::copy_n(d_result.begin(), rs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t471 t471.cu
$ ./t471
1,2,3,7,8,9,
$
With the assumptions about stencil organization made here, we could also pre-compute the result size rs with thrust::reduce, and use that to allocate the result vector size:
$ cat t471.cu
#include <thrust/copy.h>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
int main(){
int data[] = {1,2,3,4,5,6,7,8,9};
int stencil[] = {1,0,1};
int ds = sizeof(data)/sizeof(data[0]);
int ss = sizeof(stencil)/sizeof(stencil[0]);
int N = ds/ss; // assume this whole number divisible
thrust::device_vector<int> d_data(data, data+ds);
thrust::device_vector<int> d_stencil(stencil, stencil+ss);
int rs = thrust::reduce(d_stencil.begin(), d_stencil.end())*N;
thrust::device_vector<int> d_result(rs);
thrust::copy_if(d_data.begin(), d_data.end(), thrust::make_permutation_iterator(d_stencil.begin(), thrust::make_transform_iterator(thrust::counting_iterator<int>(0), _1 / N)), d_result.begin(), _1 == 1) - d_result.begin();
thrust::copy_n(d_result.begin(), rs, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -o t471 t471.cu
$ ./t471
1,2,3,7,8,9,
$

How do you get the stream associated with a thrust execution policy?

I want to be able to get the stream-id which is associated with an execution policy in thrust. I am trying to access this function.
I have tried this :
cudaStream_t stream = 0;
auto policy = thrust::cuda::par.on(stream);
cudaStream_t str = stream(policy);
but I am getting a compilation error :
stream.cu(7): error: expression preceding parentheses of apparent call must have (pointer-to-) function type
Could I get some ideas on how to do this?
"I am trying to access this function." Trying to directly use e.g. things in detail are part of the implementation and may change from one version to the next. To wit: the file you are referring to does not even exist in the the current thrust distributed with CUDA 10.
However, this seems to work for me:
$ cat t354.cu
#include <thrust/execution_policy.h>
#include <iostream>
#include <cstring>
int main(){
cudaStream_t mystream;
cudaStreamCreate(&mystream);
auto policy = thrust::cuda::par.on(mystream);
cudaStream_t str = stream(policy);
for (int i = 0; i < sizeof(cudaStream_t); i++)
if ( *(reinterpret_cast<unsigned char *>(&mystream)+i) != *(reinterpret_cast<unsigned char *>(&str)+i)) {std::cout << "mismatch" << std::endl; return -1;}
std::cout << "match" << std::endl;
}
$ nvcc -std=c++11 -o t354 t354.cu
$ cuda-memcheck ./t354
========= CUDA-MEMCHECK
match
========= ERROR SUMMARY: 0 errors
$

CUDA: method to calculate all partial sums during a sum reduction

I run into this issue over and over in CUDA. I have, for a set of elements, done some GPU calculation. This results in some value that has linear meaning (for instance, in terms of memory):
element_sizes = [ 10, 100, 23, 45 ]
And now, for the next stage of GPU calculation, I need the following values:
memory_size = sum(element_sizes)
memory_offsets = [ 0, 10, 110, 133 ]
I can calculate memory_size at 80 gbps on my GPU using the reduction code available from NVIDIA. However, I can't use this code, as it uses a branching technique that does not compose the memory offsets array. I have tried many things, but what I have found is that simply copying over elements_sizes to the host and calculating the offsets with a simd for loop is the simplest, fastest, way to go:
// in pseudo code
host_element_sizes = copy_to_host(element_sizes);
host_offsets = (... *) malloc(...);
int total_size = 0;
for(int i = 0; i < ...; ...){
host_offsets[i] = total_size;
total_size += host_element_sizes[i];
}
device_offsets = (... *) device_malloc(...);
device_offsets = copy_to_device(host_offsets,...);
However, I have done this many times now, and it is starting to become a bottleneck. This seems like a typical problem, but I have found no work-around.
What is the expected way for a CUDA programmer to solve this problem?
I think the algorithm you are looking for is a prefix sum. A prefix sum on a vector produces another vector which contains the cumulative sum values of the input vector. A prefix sum exists in at least two variants - an exclusive scan or an inclusive scan. Conceptually these are similar.
If your element_sizes vector has been deposited in GPU global memory (it appears to be the case based on your pseudocode), then there exist library functions that run on the GPU that you could call at that point, to produce the memory_offsets data (vector), and the memory_size value could be trivially obtained from the last value in the vector, with a slight variation based on whether you are doing an inclusive scan or exclusive scan.
Here's a trivial worked example using thrust:
$ cat t319.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/copy.h>
#include <iostream>
int main(){
const int element_sizes[] = { 10, 100, 23, 45 };
const int ds = sizeof(element_sizes)/sizeof(element_sizes[0]);
thrust::device_vector<int> dv_es(element_sizes, element_sizes+ds);
thrust::device_vector<int> dv_mo(ds);
thrust::exclusive_scan(dv_es.begin(), dv_es.end(), dv_mo.begin());
std::cout << "element_sizes:" << std::endl;
thrust::copy_n(dv_es.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_offsets:" << std::endl;
thrust::copy_n(dv_mo.begin(), ds, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl << "memory_size:" << std::endl << dv_es[ds-1] + dv_mo[ds-1] << std::endl;
}
$ nvcc -o t319 t319.cu
$ ./t319
element_sizes:
10,100,23,45,
memory_offsets:
0,10,110,133,
memory_size:
178
$

storing return value of thrust reduce_by_key on device vectors

I have been trying to use thrust function reduce_by_key on device vectors. In the documentation they have given example on host_vectors instead of any device vector. The main problem I am getting is in storing the return value of the function. To be more specific here is my code:
thrust::device_vector<int> hashValueVector(10)
thrust::device_vector<int> hashKeysVector(10)
thrust::device_vector<int> d_pathId(10);
thrust::device_vector<int> d_freqError(10); //EDITED
thrust::pair<thrust::detail::normal_iterator<thrust::device_ptr<int> >,thrust::detail::normal_iterator<thrust::device_ptr<int> > > new_end; //THE PROBLEM
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
I tried declaring them as device_ptr first in the definition since for host_vectors too they have used pointers in the definition in the documentation. But I am getting compilation error when I try that then I read the error statement and converted the declaration to the above, this is compiling fine but I am not sure whether this is the right way to define or not.
I there any other standard/clean way of declaring that (the "new_end" variable)? Please comment if my question is not clear somewhere.
EDIT: I have edited the declaration of d_freqError. It was supposed to be int I wrote it as hashElem by mistake, sorry for that.
There is a problem in the setup of your reduce_by_key operation:
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
Notice that in the documentation it states:
OutputIterator2 is a model of Output Iterator and and InputIterator2's value_type is convertible to OutputIterator2's value_type.
The value type of your InputIterator2 (i.e. hashValueVector.begin()) is int . The value type of your OutputIterator2 is struct hashElem. Thrust is not going to know how to convert an int to a struct hashElem.
Regarding your question, it should not be difficult to capture the return entity from reduce_by_key. According to the documentation it is a thrust pair of two iterators, and these iterators should be consistent with (i.e. of the same vector type and value type) as your keys iterator type and your values iterator type, respectively.
Here's an updated sample based on what you posted, which compiles cleanly:
$ cat t353.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/pair.h>
#include <thrust/reduce.h>
#include <thrust/sequence.h>
#include <thrust/fill.h>
#include <thrust/copy.h>
typedef thrust::device_vector<int>::iterator dIter;
int main(){
thrust::device_vector<int> hashValueVector(10);
thrust::device_vector<int> hashKeysVector(10);
thrust::device_vector<int> d_pathId(10);
thrust::device_vector<int> d_freqError(10);
thrust::sequence(hashValueVector.begin(), hashValueVector.end());
thrust::fill(hashKeysVector.begin(), hashKeysVector.begin()+5, 1);
thrust::fill(hashKeysVector.begin()+6, hashKeysVector.begin()+10, 2);
thrust::pair<dIter, dIter> new_end;
new_end = thrust::reduce_by_key(hashKeysVector.begin(),hashKeysVector.end(),hashValueVector.begin(),d_pathId.begin(),d_freqError.begin());
std::cout << "Number of results are: " << new_end.first - d_pathId.begin() << std::endl;
thrust::copy(d_pathId.begin(), new_end.first, std::ostream_iterator<int>(std::cout, "\n"));
thrust::copy(d_freqError.begin(), new_end.second, std::ostream_iterator<int>(std::cout, "\n"));
}
$ nvcc -arch=sm_20 -o t353 t353.cu
$ ./t353
Number of results are: 3
1
0
2
10
5
30
$