CUDA Thrust: Finding the index of the first element in a vector satisfying a predicate (e.g., zero or negative) [Matlab's syntax min(find(x<=0))] - cuda

I am attempting to find the the index of the first zero or negative value of an array using CUDA Thrust. The serial CPU code I am attempting to write using CUDA Thrust is the following:
for (int i = StartIndex; i <= ArrayLimitIndex; i++)
{
if (Array[i] <= 0) { DesiredIndex = i; break; }
}
I am thinking that the easiest way to do this on the GPU will be using the find_if function within the Thrust library.
The array is already on the GPU and I am attempting to search for the index on this array using Thrust as such:
struct less_than_or_eq_zero
{
__host__ __device__
bool operator() (double x)
{
return x <= 0;
}
};
thrust::device_vector<double>::iterator iter;
thrust::device_ptr<double> dev_ptr_Col46 = thrust::device_pointer_cast(dev_Col46);
iter = thrust::find_if(thrust::device, dev_ptr_Col46, dev_ptr_Col46 + size,less_than_or_eq_zero());
Now I would like to use the value of iter as an argument for my next kernel:
newKernel<<<size, 1>>>(*dev_array, iter)
where the newKernel definition is of the form:
__global__ void newKernel(double *dev_array, iter)
{
int x = blockIdx.x;
if(x <= iter)
{
//process data here...
}
}
I know that the code I have here is incorrect and I have a few questions regarding the use of iter. First, iter is a device_vector. Is there any way I can make iter just one value and not a vector? Also, when I have executed the find_if how can I use the value of iter in my next kernel call?
Any help with this be greatly appreciated.
Thanks

I'm summarizing the comments by talonmies and Jared Hoberock above as well as the answer by Sebastian Dressler in a fully compilable and executable example. The code calculates, by CUDA Thrust, the index of the first element of a vector satisfying a predicate (x<=0. in this case), I hope it will be helpful for future readers.
#include <thrust/device_vector.h>
#include <stdio.h>
struct less_than_or_eq_zero
{
__host__ __device__ bool operator() (double x) { return x <= 0.; }
};
int main(void)
{
int N = 6;
thrust::device_vector<float> D(N);
D[0] = 3.;
D[1] = 2.3;
D[2] = -1.3;
D[3] = 0.;
D[4] = 3.;
D[5] = -44.;
thrust::device_vector<float>::iterator iter1 = D.begin();
thrust::device_vector<float>::iterator iter2 = thrust::find_if(D.begin(), D.begin() + N, less_than_or_eq_zero());
int d = thrust::distance(iter1, iter2);
printf("Index = %i\n",d);
getchar();
return 0;
}

As you do not use a device_vector in your kernel but a raw array, you have to pass it an index and not an iterator. You can obtain the index by using thrust::distance to calculate the distance between dev_ptr_Col46 and iter.
You'll also want to read thrust iterators documentation, where distance is documented.

Try this:
thrust::device_ptr<double> val_ptr = thrust::find_if(dev_ptr_Col46, dev_ptr_Col46 + size,less_than_or_eq_zero());
double * val = thrust::raw_pointer_cast(val_ptr);
newKernel<<<size, 1>>>(dev_array, val)
Your kernel will have to have signature
__global__ void newKernel(double * dev_array, double * val)

Related

How to guarantee the random number generator seed to be different each time in thrust

When we write a CUDA kernel, we always do this to guarantee the seed can be updated.
__global__ void kernel(curandState *globalState){
curandState *localState;
localState = globalState;
// generate random number with localState.
globalState = localState;
}
and if we run the kernel for several times, the random number can always be different.
My question is that if we want to use thrust to generate random number based on this question:
Generating a random number vector between 0 and 1.0 using Thrust
and talonmies' answer, when we need to run several times with the same functor prg, how we could have different seed for each operation?
I tried to rewrite the code as following:
#include<thrust/random.h>
#include<thrust/device_vector.h>
#include<thrust/transform.h>
#include<thrust/iterator/counting_iterator.h>
#include<iostream>
#include<time.h>
struct prg
{
float a, b;
unsigned int N;
__host__ __device__
prg(float _a=0.f, float _b=1.f, unsigned int _N = time(NULL)) : a(_a), b(_b), N(_N) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng(N);
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
int main(void)
{
const int N = 5;
thrust::device_vector<float> numbers(N);
thrust::counting_iterator<unsigned int> index_sequence_begin(0);
// first operation
thrust::transform(index_sequence_begin,index_sequence_begin + N, numbers.begin(),prg(1.f,2.f));
for(int i = 0; i < N; i++)
{
std::cout << numbers[i] << std::endl;
}
// second operation
thrust::transform(index_sequence_begin,index_sequence_begin + N, numbers.begin(),prg(1.f,2.f));
for(int i = 0; i < N; i++)
{
std::cout << numbers[i] << std::endl;
}
return 0;
}
The first operation and second operation generate the same number. I know it is because the time difference is short, then how should I modify the code to get different random numbers for these two operations? I guess it is possible to assign the seed based on the operation time,(1,2,.....10000, 10001, ...N), but will it be expensive to do that?
To paraphrase John von Neumann "Nothing as important as random numbers should be left to chance".
If you cannot guarantee that the seeds for the random generators are different (and it appears you cannot in this case), then don't try and have different seeds. Use one seeded generator instance and take different sequences from it.

Using both CUB and Thrust for parallel sum scan

I am trying to do parallel sum scan on a test vector. I am using both Thrust and CUB library for this purpose
struct CustomSum
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return a + b;
}
};
// 2d array stored in row-major order [(0,0), (0,1), (0,2), ... ]
thrust::host_vector<int> hVec_I1(SIZE_IMG, 1);
thrust::host_vector<int> hVec_I2(SIZE_IMG, 1);
thrust::host_vector<int> h_out(SIZE_IMG, 1);
CustomSum sum_op;
// Innitialize vector with synthetic image:
initialize(N, N, hVec_I1, hVec_I2);
// Compute Integral Image M1 and M2
thrust::device_vector<int> dVec_M1 = hVec_I1;
thrust::device_vector<int> dVec_M2 = hVec_I2;
thrust::device_vector<int> d_o = h_out;
//thrust::device_ptr<double> d_in = dVec_M1.data();
//thrust::device_ptr<double> d_out1 = d_out.data();
int* d_in = thrust::raw_pointer_cast(&dVec_M1[0]);
int *d_out = thrust::raw_pointer_cast(&d_o[0]);
//d_in = thrust::raw_pointer_cast(dVec_M2.data());
//thrust::device_vector<int> d_out;
//int *d_out = thrust::raw_pointer_cast(dVec_M1.data());
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
// Run inclusive prefix sum-scan
cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, sum_op, SIZE_IMG);
// Allocate temporary storage for inclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);
// Run inclusive prefix sum-scan
cub::DeviceScan::InclusiveScan(d_temp_storage, temp_storage_bytes, d_in, d_out, sum_op, SIZE_IMG);
The error I am getting is
Error 43 error : calling a __host__ function("CustomSum::operator ()<int> ") from a __device__ function("cub::TilePrefixCallbackOp<int, CustomSum, cub::ScanTileState<int, (bool)1> > ::operator ()") is not allowed c:\users\asu_cuda_laptop\documents\visual studio 2013\projects\stats_kernel\cub\agent\single_pass_scan_operators.cuh 747 1 stats_kernel
I could not interpret the error correctly and I am sure there is a problem with the way I am handling raw pointers. Any help is appreciated.
Related link: How to use CUB and Thrust in one CUDA code
Try defining CustomSum::operator() as a __device__ function. More on __host__ vs __device__ functions in the CUDA C programming guide.

How to implement properly an inline function in the device that returns a vector to another device function?

I want to implement properly an inlined device function that fill out a vector of dynamic size and return the filled vector like:
__device__ inline thrust::device_vector<double> make_array(double zeta, int l)
{
thrust::device_vector<double> ret;
int N =(int)(5*l+zeta); //the size of the array will depend on l and zeta, in a complex way...
// Make sure of sufficient memory allocation
ret.reserve(N);
// Resize array
ret.resize(N);
//fill it:
//for(int i=0;i<N;i++)
// ...;
return ret;
}
My goal is to use the content of the returned vector in another device function like:
__device__ inline double use_array(double zeta,int l)
{
thrust::device_vector<double> array = make_array(zeta, l);
double result = 0;
for(int i=0; i<array.size(); i++)
result += array[i];
return result;
}
How can I do it properly? my feeling is that a thrust vector is designed for this type of task, but I want to do it properly. What is the standard CUDA approach to this task?
thrust::device_vector is not usable in device code.
However you can return a pointer to a dynamically allocated area, like so:
#include <assert.h>
template <typename T>
__device__ T* make_array(T zeta, int l)
{
int N =(int)(5*l+zeta); //the size of the array will depend on l and zeta, in a complex way...
T *ret = (T *)malloc(N*sizeof(T));
assert(ret != NULL); // error checking
//fill it:
//for(int i=0;i<N;i++)
// ret[i] = ...;
return ret;
}
The inline keyword should not be necessary. The compiler will aggressively inline functions wherever possible.

thrust::device_vector in constant memory

I have a float array that needs to be referenced many times on the device, so I believe the best place to store it is in __ constant __ memory (using this reference). The array (or vector) will need to be written once at run-time when initializing, but read by multiple different functions many millions of times, so constant copying to the kernel each function call seems like A Bad Idea.
const int n = 32;
__constant__ float dev_x[n]; //the array in question
struct struct_max : public thrust::unary_function<float,float> {
float C;
struct_max(float _C) : C(_C) {}
__host__ __device__ float operator()(const float& x) const { return fmax(x,C);}
};
void foo(const thrust::host_vector<float> &, const float &);
int main() {
thrust::host_vector<float> x(n);
//magic happens populate x
cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float));
foo(x,0.0);
return(0);
}
void foo(const thrust::host_vector<float> &input_host_x, const float &x0) {
thrust::device_vector<float> dev_sol(n);
thrust::host_vector<float> host_sol(n);
//this method works fine, but the memory transfer is unacceptable
thrust::device_vector<float> input_dev_vec(n);
input_dev_vec = input_host_x; //I want to avoid this
thrust::transform(input_dev_vec.begin(),input_dev_vec.end(),dev_sol.begin(),struct_max(x0));
host_sol = dev_sol; //this memory transfer for debugging
//this method compiles fine, but crashes at runtime
thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x);
thrust::transform(dev_ptr,dev_ptr+n,dev_sol.begin(),struct_max(x0));
host_sol = dev_sol; //this line crashes
}
I tried adding a global thrust::device_vector dev_x(n), but that also crashed at run-time, and would be in __ global __ memory rather than __ constant__ memory
This can all be made to work if I just discard the thrust library, but is there a way to use the thrust library with globals and device constant memory?
Good question! You can't cast a __constant__ array as if it's a regular device pointer.
I will answer your question (after the line below), but first: this is a bad use of __constant__, and it isn't really what you want. The constant cache in CUDA is optimized for uniform access across threads in a warp. That means all threads in the warp access the same location at the same time. If each thread of the warp accesses a different constant memory location, then the accesses get serialized. So your access pattern, where consecutive threads access consecutive memory locations, will be 32 times slower than a uniform access. You should really just use device memory. If you need to write the data once, but read it many times, then just use a device_vector: initialize it once, and then read it many times.
To do what you asked, you can use a thrust::counting_iterator as the input to thrust::transform to generate a range of indices into your __constant__ array. Then your functor's operator() takes an int index operand rather than a float value operand, and does the lookup into constant memory.
(Note that this means your functor is now __device__ code only. You could easily overload the operator to take a float and call it differently on host data if you need portability.)
I modified your example to initialize the data and print the result to verify that it is correct.
#include <stdio.h>
#include <stdlib.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/counting_iterator.h>
const int n = 32;
__constant__ float dev_x[n]; //the array in question
struct struct_max : public thrust::unary_function<float,float> {
float C;
struct_max(float _C) : C(_C) {}
// only works as a device function
__device__ float operator()(const int& i) const {
// use index into constant array
return fmax(dev_x[i],C);
}
};
void foo(const thrust::host_vector<float> &input_host_x, const float &x0) {
thrust::device_vector<float> dev_sol(n);
thrust::host_vector<float> host_sol(n);
thrust::device_ptr<float> dev_ptr = thrust::device_pointer_cast(dev_x);
thrust::transform(thrust::make_counting_iterator(0),
thrust::make_counting_iterator(n),
dev_sol.begin(),
struct_max(x0));
host_sol = dev_sol; //this line crashes
for (int i = 0; i < n; i++)
printf("%f\n", host_sol[i]);
}
int main() {
thrust::host_vector<float> x(n);
//magic happens populate x
for (int i = 0; i < n; i++) x[i] = rand() / (float)RAND_MAX;
cudaMemcpyToSymbol(dev_x,x.data(),n*sizeof(float));
foo(x, 0.5);
return(0);
}

Thrust reduce not working with non equal input/output types

I'm attempting to reduce the min and max of an array of values using Thrust and I seem to be stuck. Given an array of floats what I would like is to reduce their min and max values in one pass, but using thrust's reduce method I instead get the mother (or at least auntie) of all template compile errors.
My original code contains 5 lists of values spread over 2 float4 arrays that I want reduced, but I've boiled it down to this short example.
struct ReduceMinMax {
__host__ __device__
float2 operator()(float lhs, float rhs) {
return make_float2(Min(lhs, rhs), Max(lhs, rhs));
}
};
int main(int argc, char *argv[]){
thrust::device_vector<float> hat(4);
hat[0] = 3;
hat[1] = 5;
hat[2] = 6;
hat[3] = 1;
ReduceMinMax binary_op_of_dooooom;
thrust::reduce(hat.begin(), hat.end(), 4.0f, binary_op_of_dooooom);
}
If I split it into 2 reductions instead it of course works. My question is then: Is it possible to reduce both the min and max in one pass with thrust and how? If not then what is the most efficient way of achieving said reduction? Will a transform iterator help me (and if so, will the reduction then be a one pass reduction?)
Some additional info:
I'm using Thrust 1.5 (as supplied by CUDA 4.2.7)
My actual code is using reduce_by_key, not just reduce.
I found transform_reduce while writing this question, but that one doesn't take keys into account.
As talonmies notes, your reduction does not compile because thrust::reduce expects the binary operator's argument types to match its result type, but ReduceMinMax's argument type is float, while its result type is float2.
thrust::minmax_element implements this operation directly, but if necessary you could instead implement your reduction with thrust::inner_product, which generalizes thrust::reduce:
#include <thrust/inner_product.h>
#include <thrust/device_vector.h>
#include <thrust/extrema.h>
#include <cassert>
struct minmax_float
{
__host__ __device__
float2 operator()(float lhs, float rhs)
{
return make_float2(thrust::min(lhs, rhs), thrust::max(lhs, rhs));
}
};
struct minmax_float2
{
__host__ __device__
float2 operator()(float2 lhs, float2 rhs)
{
return make_float2(thrust::min(lhs.x, rhs.x), thrust::max(lhs.y, rhs.y));
}
};
float2 minmax1(const thrust::device_vector<float> &x)
{
return thrust::inner_product(x.begin(), x.end(), x.begin(), make_float2(4.0, 4.0f), minmax_float2(), minmax_float());
}
float2 minmax2(const thrust::device_vector<float> &x)
{
using namespace thrust;
pair<device_vector<float>::const_iterator, device_vector<float>::const_iterator> ptr_to_result;
ptr_to_result = minmax_element(x.begin(), x.end());
return make_float2(*ptr_to_result.first, *ptr_to_result.second);
}
int main()
{
thrust::device_vector<float> hat(4);
hat[0] = 3;
hat[1] = 5;
hat[2] = 6;
hat[3] = 1;
float2 result1 = minmax1(hat);
float2 result2 = minmax2(hat);
assert(result1.x == result2.x);
assert(result1.y == result2.y);
}