How to use thrust::copy_if using pointers [closed] - cuda

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I am trying to copy non-zero elements of an array to a different array using pointers. I have tried implementing the solution in thrust copy_if: incomplete type is not allowed but I get zeros in my resultant array. Here is my code:
This is the predicate functor:
struct is_not_zero
{
__host__ __device__
bool operator()( double x)
{
return (x != 0);
}
};
And this is where the copy_if function is used:
double out[5];
thrust::device_ptr<double> output = thrust::device_pointer_cast(out);
double *test1;
thrust::device_ptr<double> gauss_res(hostResults1);
thrust::copy_if(thrust::host,gauss_res, gauss_res+3,output, is_not_zero());
test1 = thrust::raw_pointer_cast(output);
for(int i =0;i<6;i++) {
cout << test1[i] << " the number " << endl;
}
where hostresult1 is the output array from a kernel.

You are making a variety of errors as discussed in the comments, and you've not provided a complete code so its not possible to state what all the errors are that you are making. Generally speaking you appear to be mixing up device and host activity, and pointers. These should generally be kept separate, and treated separately, in algorithms. The exception would be copying from device to host, but this can't be done with thrust::copy and raw pointers. You must use vector iterators or properly decorated thrust device pointers.
Here is a complete example based on what you have shown:
$ cat t66.cu
#include <thrust/copy.h>
#include <iostream>
#include <thrust/device_ptr.h>
struct is_not_zero
{
__host__ __device__
bool operator()( double x)
{
return (x != 0);
}
};
int main(){
const int ds = 5;
double *out, *hostResults1;
cudaMalloc(&out, ds*sizeof(double));
cudaMalloc(&hostResults1, ds*sizeof(double));
cudaMemset(out, 0, ds*sizeof(double));
double test1[ds];
for (int i = 0; i < ds; i++) test1[i] = 1;
test1[3] = 0;
cudaMemcpy(hostResults1, test1, ds*sizeof(double), cudaMemcpyHostToDevice);
thrust::device_ptr<double> output = thrust::device_pointer_cast(out);
thrust::device_ptr<double> gauss_res(hostResults1);
thrust::copy_if(gauss_res, gauss_res+ds,output, is_not_zero());
cudaMemcpy(test1, out, ds*sizeof(double), cudaMemcpyDeviceToHost);
for(int i =0;i<ds;i++) {
std::cout << test1[i] << " the number " << std::endl;
}
}
$ nvcc -o t66 t66.cu
$ ./t66
1 the number
1 the number
1 the number
1 the number
0 the number

Related

Need help optimizing thrust cuda code with nested iterator transform_reduce operations

I am working on code I would like to execute efficiently on a GPU. Most of the code has been easy to vectorize and prepare for parallel execution. There are several nice examples on Stack Overflow that have helped me with the standard nested iterators. I have one section I have not been able to successfully condense into an efficient thrust construct. I have taken that section of my code and made a minimum reproducible example. Any advice or hint on how to structure this code would be appreciated.
Thanks
#include <algorithm>
#include <iostream>
#include <numeric>
#include <vector>
#include <ctime>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
typedef thrust::device_vector<double> tDoubleVecDevice;
typedef tDoubleVecDevice::iterator tDoubleVecDeviceIter;
struct functorB{
template <typename T>
__host__ __device__
double operator()(const T &my_tuple){ // do some math
return ( fmod((thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple)),1.0) );
}
};
struct functorC {
template <typename T>
__host__ __device__
double operator()(const T &my_tuple){ // do some math
double distance = fabs( fmod((thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple)),1.0));
return((fmin( distance, 1.0 - distance)) / (5.0));
}
};
int main(void)
{
tDoubleVecDevice resF(36);
tDoubleVecDevice freqI(36);
tDoubleVecDevice trialTs(128);
std::srand(std::time(nullptr));
for(tDoubleVecDeviceIter tIter = trialTs.begin();tIter < trialTs.end(); tIter++ ){
(*tIter) = rand() % 10 + 1.5; // make some random numbers
}
for(tDoubleVecDeviceIter rIter = resF.begin(), fIter = freqI.begin();fIter < resF.end(); rIter++ ,fIter++){
(*fIter) = rand() % 10 + 1.5; // make some random numbers
(*rIter) = rand() % 10 + 1.5; // make some random numbers
}
tDoubleVecDevice trialRs(36);
tDoubleVecDevice errorVect(128);
for( tDoubleVecDeviceIter itTrial = trialTs.begin(), itError = errorVect.begin(); itTrial != trialTs.end(); itTrial++,itError++){
thrust::transform( (thrust::make_zip_iterator(thrust::make_tuple(thrust::make_constant_iterator<double>(*itTrial), freqI.begin()))),
(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_constant_iterator<double>(*itTrial)+36, freqI.end()))),
trialRs.begin() ,functorB());
(*itError) =thrust::transform_reduce(
thrust::make_zip_iterator(thrust::make_tuple(trialRs.begin(),resF.begin())),
thrust::make_zip_iterator(thrust::make_tuple(trialRs.end(),resF.end())),
functorC(),(double) 0,thrust::plus<double>()
);
}
// finds the index of the minimum element;
int minElementIndex = thrust::min_element(errorVect.begin(),errorVect.end()) - errorVect.begin();
double result = trialTs[minElementIndex];
std::cout << "result = " << result;
return 0;
}
It looks like you need to expand your trialsTs,trialsRs,errorVect,freqI and resF vectors to 4608 elements. This will allow you to vectorize the loops. Derive a class from thrust::iterator_adaptor to make a cyclic iterator to expand your freqI and resF to create repeated sequences of the data in those vectors.
After you run your functors use a reduce by key transform to create your error result with each 36 element trial.
Give that a try and if you get stuck I will provide some additional code.

Multiple occurrence subvector search with cuda Thrust

I want to find occurrences of subvector in a device vector in GPU, with thrust library.
Say for an array of str = "aaaabaaab", I need to find occurrences of substr = "ab".
How shall I use thrust::find function to search a subvector?
In nutshell How shall I implement string search algorithm with thrust library?
I would agree with the comments provided that thrust doesn't provide a single function that does this in "typical thrust fashion" and you would not want to use a sequence of thrust functions (e.g. a loop) as that would likely be quite inefficient.
A fairly simple CUDA kernel can be written that does this in a brute-force fashion.
For relatively simple CUDA kernels, we can realize something equivalent in thrust in a "un-thrust-like" fashion, by simply passing the CUDA kernel code as a functor to a thrust per-element operation such as thrust::transform or thrust::for_each.
Here is an example:
$ cat t462.cu
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
struct my_f
{
char *array, *string;
size_t arr_len;
int str_len;
my_f(char *_array, size_t _arr_len, char *_string, int _str_len) :
array(_array), arr_len(_arr_len), string(_string), str_len(_str_len) {};
__host__ __device__
bool operator()(size_t idx){
for (int i=0; i < str_len; i++)
if ((i+idx)>= arr_len) return false;
else if (array[i+idx] != string[i]) return false;
return true;
}
};
int main(){
char data[] = "aaaabaaab";
char str[] = "ab";
size_t data_len = sizeof(data)-1;
int str_len = sizeof(str)-1;
thrust::device_vector<char> d_data(data, data+data_len);
thrust::device_vector<char> d_str(str, str+str_len);
thrust::device_vector<bool> result(data_len);
thrust::transform(thrust::counting_iterator<size_t>(0), thrust::counting_iterator<size_t>(data_len), result.begin(), my_f(thrust::raw_pointer_cast(d_data.data()), data_len, thrust::raw_pointer_cast(d_str.data()), str_len));
thrust::copy(result.begin(), result.end(), std::ostream_iterator<bool>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -o t462 t462.cu
$ ./t462
0,0,0,1,0,0,0,1,0,
$
Whether or not such a "brute-force" approach is efficient for this type of problem I don't know. Probably there are better/more efficient methods, especially when searching for occurrence of longer strings.

C++ the returned array value is not correct

First of thanks for giving me a hand with this. I am no expert at C++ but i have done some work in C. My code problem is that it would not display the returned array value correctly.
In general what my program trying to do is to evaluate a function F(x) , display it in a table format and find its min and max. I have find ways of doing all that but when I want to display the returned value of array F(x) it somehow got distorted.The first value is always correct for example like
cout << *(value+0) <<endl;
but the next one the value is not the same as the supposed f(x).Sorry in advance if my code is not up to the proper standard but i been wrapping my head over this for awhile now.
My Full Code
#include <iostream>
#include <fstream>
#include <cmath>
#include <iomanip>
#include <string>
#include <stdlib.h>
using namespace std;
float *evaluate ();
void display ();
void Min_Max(float *);
int main()
{
float *p;
evaluate();
display();
cin.get();
p = evaluate();
Min_Max(p);
return 0;
}
float *evaluate()
{
ofstream Out_File("result.txt");
int n=30;
float x [n];
float fx[n];
float interval = ((4-(-2))/0.2);
x[0]= -2.0;
for(n=0;n <= interval;n++)
{
fx[n] = 4*exp((-x[n])/2)*sin((2*x[n]- 0.3)*3.14159/180);
x[n+1] = x[n] + 0.2;
if (Out_File.is_open())
{
Out_File <<setprecision(5)<<setw(8)<<showpoint<<fixed<< x[n];
Out_File << "\t\t"<<setprecision(5)<<setw(8)<<showpoint<<fixed<<fx[n]<<endl;
}
else cout << "Unable to open file";
}
Out_File.close();
return fx;
}
void display()
{
ifstream inFile;
inFile.open("result.txt");
string line;
cout << " x\t\t\t f(x)"<<endl;
cout << "_______________________________________"<<endl;
while( getline (inFile,line))
{
cout<<line<<endl;
}
inFile.close();
}
void Min_Max(float *value)
{
int a=0;
for(a=0;a<=30;a++){
cout << *(value+a) <<endl;
*value =0;}
}
I see, you pass p to your function Min_Max. Where p is a pointer to an entry point of an array. That array is created as a local variable in another function evaluate. That doesn't work, because as soon as evaluate has finished, all its local variables, such as the fx array, get destroyed and the pointer you return then points to "nothing".
In that case you can use heap memory (use new operator) to allocate the fx. But don't forget to free it afterward.
Also, look here

How to bring equal elements together using thrust without sort

I have an array of elements such that each element defines the "equal to" operator only.
In other words no ordering is defined for such type of element.
Since I can't use thrust::sort as in the thrust histogram example how can I bring equal elements together using thrust?
For example:
my array is initially
a e t b c a c e t a
where identical characters represent equal elements.
After the elaboration, the array should be
a a a t t b c c e e
but it can be also
a a a c c t t e e b
or any other permutation.
I would recommend that you follow an approach such as that laid out by #m.s. in the posted answer there. As I stated in the comments, ordering of elements is an extremely useful mechanism that aids in the reduction of complexity for problems like this.
However the question as posed asks if it is possible to group like elements without sorting. With an inherently parallel processor like a GPU, I spent some time thinking about how it might be accomplished without sorting.
If we have both a large number of objects, as well as a large number of unique object types, then I think it's possible to bring some level of parallelism to the problem, however my approach outlined here will still have atrocious, scattered memory access patterns. For the case where there are only a small number of distinct or unique object types, the algorithm I am discussing here has little to commend it. This is just one possible approach. There may well be other, far better approaches:
The starting point is to develop a set of "linked lists" that indicate the matching neighbor to the left and the matching neighbor to the right, for each element. This is accomplished via my search_functor and thrust::for_each, on the entire data set. This step is reasonably parallel and also has reasonable memory access efficiency for large data sets, but it does require a worst-case traversal of the entire data set from start to finish (a side-effect, I would call it, of not being able to use ordering; we must compare every element to other elements until we find a match). The generation of two linked lists allows us to avoid all-to-all comparisons.
Once we have the lists (right-neighbor and left-neighbor) built from step 1, it's an easy matter to count the number of unique objects, using thrust::count.
We then get the starting indexes of each unique element (i.e. the leftmost index of each type of unique element, in the dataset), using thrust::copy_if stream compaction.
The next step is to count the number of instances of each of the unique elements. This step is doing list traversal, one thread per element list. If I have a small number of unique elements, this will not effectively utilize the GPU. In addition, the list traversal will result in lousy access patterns.
After we have counted the number of each type of object, we can then build a sequence of starting indices for each object type in the output list, via thrust::exclusive_scan on the numbers of each type of object.
Finally, we can copy each input element to it's appropriate place in the output list. Since we have no way to group or order the elements yet, we must again resort to list traversal. Once again, this will be inefficient use of the GPU if the number of unique object types is small, and will also have lousy memory access patterns.
Here's a fully worked example, using your sample data set of characters. To help clarify the idea that we intend to group objects that have no inherent ordering, I have created a somewhat arbitrary object definition (my_obj), that has the == comparison operator defined, but no definition for < or >.
$ cat t707.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/for_each.h>
#include <thrust/transform.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#include <thrust/count.h>
#include <iostream>
template <typename T>
class my_obj
{
T element;
int index;
public:
__host__ __device__ my_obj() : element(0), index(0) {};
__host__ __device__ my_obj(T a) : element(a), index(0) {};
__host__ __device__ my_obj(T a, int idx) : element(a), index(idx) {};
__host__ __device__
T get() {
return element;}
__host__ __device__
void set(T a) {
element = a;}
__host__ __device__
int get_idx() {
return index;}
__host__ __device__
void set_idx(int idx) {
index = idx;}
__host__ __device__
bool operator ==(my_obj &e2)
{
return (e2.get() == this->get());
}
};
template <typename T>
struct search_functor
{
my_obj<T> *data;
int end;
int *rn;
int *ln;
search_functor(my_obj<T> *_a, int *_rn, int *_ln, int len) : data(_a), rn(_rn), ln(_ln), end(len) {};
__host__ __device__
void operator()(int idx){
for (int i = idx+1; i < end; i++)
if (data[idx] == data[i]) {
ln[i] = idx;
rn[idx] = i;
return;}
return;
}
};
template <typename T>
struct copy_functor
{
my_obj<T> *data;
my_obj<T> *result;
int *rn;
copy_functor(my_obj<T> *_in, my_obj<T> *_out, int *_rn) : data(_in), result(_out), rn(_rn) {};
__host__ __device__
void operator()(const thrust::tuple<int, int> &t1) const {
int idx1 = thrust::get<0>(t1);
int idx2 = thrust::get<1>(t1);
result[idx1] = data[idx2];
int i = rn[idx2];
int j = 1;
while (i != -1){
result[idx1+(j++)] = data[i];
i = rn[i];}
return;
}
};
struct count_functor
{
int *rn;
int *ot;
count_functor(int *_rn, int *_ot) : rn(_rn), ot(_ot) {};
__host__ __device__
int operator()(int idx1, int idx2){
ot[idx1] = idx2;
int i = rn[idx1];
int count = 1;
while (i != -1) {
ot[i] = idx2;
count++;
i = rn[i];}
return count;
}
};
using namespace thrust::placeholders;
int main(){
// data setup
char data[] = { 'a' , 'e' , 't' , 'b' , 'c' , 'a' , 'c' , 'e' , 't' , 'a' };
int sz = sizeof(data)/sizeof(char);
for (int i = 0; i < sz; i++) std::cout << data[i] << ",";
std::cout << std::endl;
thrust::host_vector<my_obj<char> > h_data(sz);
for (int i = 0; i < sz; i++) { h_data[i].set(data[i]); h_data[i].set_idx(i); }
thrust::device_vector<my_obj<char> > d_data = h_data;
// create left and right neighbor indices
thrust::device_vector<int> ln(d_data.size(), -1);
thrust::device_vector<int> rn(d_data.size(), -1);
thrust::for_each(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(0) + sz, search_functor<char>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(rn.data()), thrust::raw_pointer_cast(ln.data()), d_data.size()));
// determine number of unique objects
int uni_objs = thrust::count(ln.begin(), ln.end(), -1);
// determine the number of instances of each unique object
// get object starting indices
thrust::device_vector<int> uni_obj_idxs(uni_objs);
thrust::copy_if(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(0)+d_data.size(), ln.begin(), uni_obj_idxs.begin(), (_1 == -1));
// count each object list
thrust::device_vector<int> num_objs(uni_objs);
thrust::device_vector<int> obj_type(d_data.size());
thrust::transform(uni_obj_idxs.begin(), uni_obj_idxs.end(), thrust::counting_iterator<int>(0), num_objs.begin(), count_functor(thrust::raw_pointer_cast(rn.data()), thrust::raw_pointer_cast(obj_type.data())));
// at this point, we have built object lists that have allowed us to identify a unique, orderable "type" for each object
// the sensible thing to do would be to employ a sort_by_key on obj_type and an index sequence at this point
// and use the reordered index sequence to reorder the original objects, thus grouping them
// however... without sorting...
// build output vector indices
thrust::device_vector<int> copy_start(num_objs.size());
thrust::exclusive_scan(num_objs.begin(), num_objs.end(), copy_start.begin());
// copy (by object type) input to output
thrust::device_vector<my_obj<char> > d_result(d_data.size());
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(copy_start.begin(), uni_obj_idxs.begin())), thrust::make_zip_iterator(thrust::make_tuple(copy_start.end(), uni_obj_idxs.end())), copy_functor<char>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d_result.data()), thrust::raw_pointer_cast(rn.data())));
// display results
std::cout << "Grouped: " << std::endl;
for (int i = 0; i < d_data.size(); i++){
my_obj<char> temp = d_result[i];
std::cout << temp.get() << ",";}
std::cout << std::endl;
for (int i = 0; i < d_data.size(); i++){
my_obj<char> temp = d_result[i];
std::cout << temp.get_idx() << ",";}
std::cout << std::endl;
return 0;
}
$ nvcc -o t707 t707.cu
$ ./t707
a,e,t,b,c,a,c,e,t,a,
Grouped:
a,a,a,e,e,t,t,b,c,c,
0,5,9,1,7,2,8,3,4,6,
$
In the discussion we found out that your real goal is to eliminate duplicates in a vector of float4 elements.
In order to apply thrust::unique the elements need to be sorted.
So you need a sort method for 4 dimensional data. This can be done using space-filling curves. I have previously used the z-order curve (aka morton code) to sort 3D data. There are efficient CUDA implementations for the 3D case available, however quick googling did not return a ready-to-use implementation for the 4D case.
I found a paper which lists a generic algorithm for sorting n-dimensional data points using the z-order curve:
Fast construction of k-Nearest Neighbor Graphs for Point Clouds
(see Algorithm 1 : Floating Point Morton Order Algorithm).
There is also a C++ implementation available for this algorithm.
For 4D data, the loop could be unrolled, but there might be simpler and more efficient algorithms available.
So the (not fully implemented) sequence of operations would then look like this:
#include <thrust/device_vector.h>
#include <thrust/unique.h>
#include <thrust/sort.h>
inline __host__ __device__ float dot(const float4& a, const float4& b)
{
return a.x * b.x + a.y * b.y + a.z * b.z + a.w * b.w;
}
struct identity_4d
{
__host__ __device__
bool operator()(const float4& a, const float4& b) const
{
// based on the norm function you provided in the discussion
return dot(a,b) < (0.1f*0.1f);
}
};
struct z_order_4d
{
__host__ __device__
bool operator()(const float4& p, const float4& q) const
{
// you need to implement the z-order algorithm here
// ...
}
};
int main()
{
const int N = 100;
thrust::device_vector<float4> data(N);
// fill the data
// ...
thrust::sort(data.begin(),data.end(), z_order_4d());
thrust::unique(data.begin(),data.end(), identity_4d());
}

CUDA thrust reduce is so slow? [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 9 years ago.
I am learning CUDA. Today, I try some code in the book: CUDA Application Design And Development, which make me surprised. Why CUDA Thrust is so slow? Here is the code and the output.
#include <iostream>
using namespace std;
#include<thrust/reduce.h>
#include<thrust/sequence.h>
#include<thrust/host_vector.h>
#include<thrust/device_vector.h>
#include <device_launch_parameters.h>
#include "GpuTimer.h"
__global__ void fillKernel(int *a, int n)
{
int tid = blockDim.x * blockIdx.x + threadIdx.x;
if(tid <n) a[tid] = tid;
}
void fill(int *d_a, int n)
{
int nThreadsPerBlock = 512;
int nBlock = n/nThreadsPerBlock + ((n/nThreadsPerBlock)?1:0);
fillKernel<<<nBlock, nThreadsPerBlock>>>(d_a, n);
}
int main()
{
const int N = 500000;
GpuTimer timer1, timer2;
thrust::device_vector<int> a(N);
fill(thrust::raw_pointer_cast(&a[0]), N);
timer1.Start();
int sumA = thrust::reduce(a.begin(), a.end(), 0);
timer1.Stop();
cout << "Thrust reduce costs " << timer1.Elapsed() << "ms." << endl;
int sumCheck = 0;
timer2.Start();
for(int i = 0; i < N; i++)
sumCheck += i;
timer2.Stop();
cout << "Traditional reduce costs " << timer2.Elapsed() << "ms." << endl;
if (sumA == sumCheck)
cout << "Correct!" << endl;
return 0;
}
You don't have a valid comparison. Your GPU code is doing this:
int sumA = thrust::reduce(a.begin(), a.end(), 0);
Your CPU code is doing this:
for(int i = 0; i < N; i++)
sumCheck += i;
There are so many problems with this methodology I'm not sure where to start. First of all, the GPU operation is a valid reduction which will give a valid result for any sequence of numbers in the vector a. It so happens that you have the sequence from 1 to N in a, but it doesn't have to be that way and it would still give a correct result. The CPU code only gives the correct answer for the specific sequence of 1 to N. Secondly, a smart compiler may be able to optimize the heck out of your CPU code, essentially reducing that entire loop to a constant assignment statement. (Summation from 1 to N is just (N+1)(N/2) isn't it?) I have no idea what optimizations may be going on under the hood on the CPU side.
A more valid comparison would be to do an actual arbitrary reduction in both cases. An example might be to benchmark thrust::reduce operating on a device vector vs. operating on a host vector. Or write your own serial CPU reduction code that actually operates on a vector, rather than summing the integers from 1 to N.
And as indicated in the comments if you're serious about wanting help, document things like the HW and SW platform you are running on, as well as provide all the code. I have no idea what GPUtimer does. I'm voting to close this as "too localized" because I don't think anyone would find this a useful comparison using a methodology like this.