thrust reduction result on device memory - cuda

Is it possible to leave the return value of a thrust::reduce operation in device-allocated memory? In case it is, is it just as easy as assigning the value to a cudaMalloc'ed area, or should I use a thrust::device_ptr?

Is it possible to leave the return value of a thrust::reduce operation in device-allocated memory?
The short answer is no.
thrust reduce returns a quantity, the result of the reduction. This quantity must be deposited in a host resident variable:
Take for example reduce, which is synchronous and
always returns its result to the CPU:
template<typename Iterator, typename T>
T reduce(Iterator first, Iterator last, T init);
Once the result of the operation has been returned to the CPU, you can copy it to the GPU if you like:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/reduce.h>
int main(){
thrust::device_vector<int> data(256, 1);
thrust::device_vector<int> result(1);
result[0] = thrust::reduce(data.begin(), data.end());
std::cout << "result = " << result[0] << std::endl;
return 0;
}
Another possible alternative is to use thrust::reduce_by_key which will return the reduction result to device memory, rather than copy to host memory. If you use a single key for your entire array, the net result will be a single output, similar to thrust::reduce

Yes, it should be possible by using thrust::reduce_by_key instead with a thrust::constant_iterator supplied for the keys.

Related

difference between "rint" and "nearbyint"?

what is the difference between rint and nearbyint?
Will they give some different output in some cases?
If not, is there a difference in the concept of calculations?
Since these are both C functions, we can check the man page for both of these. An excerpt:
The nearbyint() functions round their argument to an integer value in floating-point format, using the current rounding direction (see fesetround(3)) and without raising the inexact exception.
The rint() functions do the same, but will raise the inexact exception (FE_INEXACT, checkable via fetestexcept(3)) when the result differs in value from the argument.
In other words, rint allows you to do error checking while nearbyint does not. An example of error-checking:
#include <iostream>
#include <cmath>
#include <cfenv>
int main()
{
std::feclearexcept(FE_INEXACT);
double a = std::rint(93819.249);
if (!std::fetestexcept(FE_INEXACT))
std::cerr << "Bad rounding\n";
else
std::cout << a << '\n';
}

Unclear Output From Function of Two Variables in C++

I have written a function which takes two doubles and outputs some polynomial expression. This is a prototype for something (much) more complicated that I need to do later on. The code should be fairly straightforward, but I must be doing something wrong, because the output makes no sense. The function returns 0 or -0, no matter what values I pass to the arguments.
#include <iostream>
#include <cmath>
#include <iomanip>
using namespace std;
double funcTwoVars(double x, double y){
double result = (1/6)*(1/x - 3*x/4)*y;
return result;
}
int main(){
double fxy = funcTwoVars(10,10);
cout << fixed << setprecision(6) << fxy << endl;
return 0;
}
When I run it, the output is the following:
christian#christian-HP-Pavilion-x360-Convertible:~/code/HYBRIDS$ g++ functionOfTwoVars.cpp -o functionOfTwoVars
christian#christian-HP-Pavilion-x360-Convertible:~/code/HYBRIDS$ ./functionOfTwoVars
-0.000000
I have no idea why it does not output the correct value. Any suggestions?
Thanks.
I actually was able to find the answer. The problem was that I was using int values in a function that asks for doubles. That was a stupid mistake on my part.

Thrust: How to returns indices of active array elements

How can I use thrust to return the indices of active array elements i.e. return a vector of indices in which array elements are equal to 1?
Expanding on this, how would this work in the case of multi-dimensional indices given the array dimensions?
Edit: currently the function looks like this
template<class VoxelType>
void VoxelVolumeT<VoxelType>::cudaThrustReduce(VoxelType *cuda_voxels)
{
device_ptr<VoxelType> cuda_voxels_ptr(cuda_voxels);
int active_voxel_count = thrust::count(cuda_voxels_ptr, cuda_voxels_ptr + dim.x*dim.y*dim.z, 1);
device_vector<VoxelType> active_voxels;
thrust::copy_if(make_counting_iterator(0),
make_counting_iterator(dim.x*dim.y*dim.z),
cuda_voxels_ptr,
active_voxels.begin(),
_1 == 1);
}
Which is giving the error
Error 15 error : no instance of overloaded function "thrust::copy_if" matches the argument list
Combine counting_iterator with copy_if:
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/functional.h>
...
using namespace thrust;
using namespace thrust::placeholders;
copy_if(make_counting_iterator<int>(0),
make_counting_iterator<int>(array.size()), // indices from 0 to N
array.begin(), // array data
active_indices.begin(), // result will be written here
_1 == 1); // return when an element or array is equal to 1

fast CUDA thrust custom comparison operator

I'm evaluating CUDA and currently using Thrust library to sort numbers.
I'd like to create my own comparer for thrust::sort, but it slows down drammatically!
I created my own less implemetation by just copying code from functional.h.
However it seems to be compiled in some other way and works very slowly.
default comparer: thrust::less() - 94ms
my own comparer: less() - 906ms
I'm using Visual Studio 2010. What should I do to get the same performance as at option 1?
Complete code:
#include <stdio.h>
#include <cuda.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
int myRand()
{
static int counter = 0;
if ( counter++ % 10000 == 0 )
srand(time(NULL)+counter);
return (rand()<<16) | rand();
}
template<typename T>
struct less : public thrust::binary_function<T,T,bool>
{
__host__ __device__ bool operator()(const T &lhs, const T &rhs) const {
return lhs < rhs;
}
};
int main()
{
thrust::host_vector<int> h_vec(10 * 1000 * 1000);
thrust::generate(h_vec.begin(), h_vec.end(), myRand);
thrust::device_vector<int> d_vec = h_vec;
int clc = clock();
thrust::sort(d_vec.begin(), d_vec.end(), less<int>());
printf("%dms\n", (clock()-clc) * 1000 / CLOCKS_PER_SEC);
return 0;
}
The reason you are observing a difference in performance is because Thrust is implementing the sort with different algorithms depending on the arguments provided to thrust::sort.
In case 1., Thrust can prove that the sort can be implemented in linear time with a radix sort. This is because the type of the data to sort is a built-in numeric type (int), and the comparison function is the built-in less than operation -- Thrust recognizes that thrust::less<int> will produce the equivalent result as x < y.
In case 2., Thrust knows nothing about your user-provided less<int>, and has to use a more conservative algorithm based on a comparison sort which has different asymptotic complexity, even though in truth your less<int> is equivalent to thrust::less<int>.
In general, user-defined comparison operators can't be used with more restrictive, faster sorts which manipulate the binary representation of data such as radix sort. In these cases, Thrust falls back on a more general, but slower sort.

Thrust Complex Transform of 3 different size vectors

Hello I have this loop in C+, and I was trying to convert it to thrust but without getting the same results...
Any ideas?
thank you
C++ Code
for (i=0;i<n;i++)
for (j=0;j<n;j++)
values[i]=values[i]+(binv[i*n+j]*d[j]);
Thrust Code
thrust::fill(values.begin(), values.end(), 0);
thrust::transform(make_zip_iterator(make_tuple(
thrust::make_permutation_iterator(values.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n))),
binv.begin(),
thrust::make_permutation_iterator(d.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexModFunctor(n))))),
make_zip_iterator(make_tuple(
thrust::make_permutation_iterator(values.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n))) + n,
binv.end(),
thrust::make_permutation_iterator(d.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexModFunctor(n))) + n)),
thrust::make_permutation_iterator(values.begin(), thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n))),
function1()
);
Thrust Functions
struct IndexDivFunctor: thrust::unary_function<int, int>
{
int n;
IndexDivFunctor(int n_) : n(n_) {}
__host__ __device__
int operator()(int idx)
{
return idx / n;
}
};
struct IndexModFunctor: thrust::unary_function<int, int>
{
int n;
IndexModFunctor(int n_) : n(n_) {}
__host__ __device__
int operator()(int idx)
{
return idx % n;
}
};
struct function1
{
template <typename Tuple>
__host__ __device__
double operator()(Tuple v)
{
return thrust::get<0>(v) + thrust::get<1>(v) * thrust::get<2>(v);
}
};
To begin with, some general comments. Your loop
for (i=0;i<n;i++)
for (j=0;j<n;j++)
v[i]=v[i]+(B[i*n+j]*d[j]);
is the equivalent of the standard BLAS gemv operation
where the matrix is stored in row major order. The optimal way to do this on the device would be using CUBLAS, not something constructed out of thrust primitives.
Having said that, there is absolutely no way the thrust code you posted is ever going to do what your serial code does. The errors you are seeing are not as a result of floating point associativity. Fundamentally thrust::transform applies the functor supplied to every element of the input iterator and stores the result on the output iterator. To yield the same result as the loop you posted, the thrust::transform call would need to perform (n*n) operations of the fmad functor you posted. Clearly it does not. Further, there is no guarantee that thrust::transform would perform the summation/reduction operation in a fashion that would be safe from memory races.
The correct solution is probably going to be something like:
Use thrust::transform to compute the (n*n) products of the elements of B and d
Use thrust::reduce_by_key to reduce the products into partial sums, yielding Bd
Use thrust::transform to add the resulting matrix-vector product to v to yield the final result.
In code, firstly define a functor like this:
struct functor
{
template <typename Tuple>
__host__ __device__
double operator()(Tuple v)
{
return thrust::get<0>(v) * thrust::get<1>(v);
}
};
Then do the following to compute the matrix-vector multiplication
typedef thrust::device_vector<int> iVec;
typedef thrust::device_vector<double> dVec;
typedef thrust::counting_iterator<int> countIt;
typedef thrust::transform_iterator<IndexDivFunctor, countIt> columnIt;
typedef thrust::transform_iterator<IndexModFunctor, countIt> rowIt;
// Assuming the following allocations on the device
dVec B(n*n), v(n), d(n);
// transformation iterators mapping to vector rows and columns
columnIt cv_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexDivFunctor(n));
columnIt cv_end = cv_begin + (n*n);
rowIt rv_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), IndexModFunctor(n));
rowIt rv_end = rv_begin + (n*n);
dVec temp(n*n);
thrust::transform(make_zip_iterator(
make_tuple(
B.begin(),
thrust::make_permutation_iterator(d.begin(),rv_begin) ) ),
make_zip_iterator(
make_tuple(
B.end(),
thrust::make_permutation_iterator(d.end(),rv_end) ) ),
temp.begin(),
functor());
iVec outkey(n);
dVec Bd(n);
thrust::reduce_by_key(cv_begin, cv_end, temp.begin(), outkey.begin(), Bd.begin());
thrust::transform(v.begin(), v.end(), Bd.begin(), v.begin(), thrust::plus<double>());
Of course, this is a terribly inefficient way to do the computation compared to using a purpose designed matrix-vector multiplication code like dgemv from CUBLAS.
How much your results differ? Is it a completely different answer, or differs only on the last digits? Is the loop executed only once, or is it some kind of iterative process?
Floating point operations, especially those that repetedly add up or multiply certain values, are not associative, because of precision issues. Moreover, if you use fast-math optimisations, the operations may not be IEEE compilant.
For starters, check out this wikipedia section on floating-point numbers: http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems