How to bring equal elements together using thrust without sort - cuda

I have an array of elements such that each element defines the "equal to" operator only.
In other words no ordering is defined for such type of element.
Since I can't use thrust::sort as in the thrust histogram example how can I bring equal elements together using thrust?
For example:
my array is initially
a e t b c a c e t a
where identical characters represent equal elements.
After the elaboration, the array should be
a a a t t b c c e e
but it can be also
a a a c c t t e e b
or any other permutation.

I would recommend that you follow an approach such as that laid out by #m.s. in the posted answer there. As I stated in the comments, ordering of elements is an extremely useful mechanism that aids in the reduction of complexity for problems like this.
However the question as posed asks if it is possible to group like elements without sorting. With an inherently parallel processor like a GPU, I spent some time thinking about how it might be accomplished without sorting.
If we have both a large number of objects, as well as a large number of unique object types, then I think it's possible to bring some level of parallelism to the problem, however my approach outlined here will still have atrocious, scattered memory access patterns. For the case where there are only a small number of distinct or unique object types, the algorithm I am discussing here has little to commend it. This is just one possible approach. There may well be other, far better approaches:
The starting point is to develop a set of "linked lists" that indicate the matching neighbor to the left and the matching neighbor to the right, for each element. This is accomplished via my search_functor and thrust::for_each, on the entire data set. This step is reasonably parallel and also has reasonable memory access efficiency for large data sets, but it does require a worst-case traversal of the entire data set from start to finish (a side-effect, I would call it, of not being able to use ordering; we must compare every element to other elements until we find a match). The generation of two linked lists allows us to avoid all-to-all comparisons.
Once we have the lists (right-neighbor and left-neighbor) built from step 1, it's an easy matter to count the number of unique objects, using thrust::count.
We then get the starting indexes of each unique element (i.e. the leftmost index of each type of unique element, in the dataset), using thrust::copy_if stream compaction.
The next step is to count the number of instances of each of the unique elements. This step is doing list traversal, one thread per element list. If I have a small number of unique elements, this will not effectively utilize the GPU. In addition, the list traversal will result in lousy access patterns.
After we have counted the number of each type of object, we can then build a sequence of starting indices for each object type in the output list, via thrust::exclusive_scan on the numbers of each type of object.
Finally, we can copy each input element to it's appropriate place in the output list. Since we have no way to group or order the elements yet, we must again resort to list traversal. Once again, this will be inefficient use of the GPU if the number of unique object types is small, and will also have lousy memory access patterns.
Here's a fully worked example, using your sample data set of characters. To help clarify the idea that we intend to group objects that have no inherent ordering, I have created a somewhat arbitrary object definition (my_obj), that has the == comparison operator defined, but no definition for < or >.
$ cat t707.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/for_each.h>
#include <thrust/transform.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#include <thrust/count.h>
#include <iostream>
template <typename T>
class my_obj
{
T element;
int index;
public:
__host__ __device__ my_obj() : element(0), index(0) {};
__host__ __device__ my_obj(T a) : element(a), index(0) {};
__host__ __device__ my_obj(T a, int idx) : element(a), index(idx) {};
__host__ __device__
T get() {
return element;}
__host__ __device__
void set(T a) {
element = a;}
__host__ __device__
int get_idx() {
return index;}
__host__ __device__
void set_idx(int idx) {
index = idx;}
__host__ __device__
bool operator ==(my_obj &e2)
{
return (e2.get() == this->get());
}
};
template <typename T>
struct search_functor
{
my_obj<T> *data;
int end;
int *rn;
int *ln;
search_functor(my_obj<T> *_a, int *_rn, int *_ln, int len) : data(_a), rn(_rn), ln(_ln), end(len) {};
__host__ __device__
void operator()(int idx){
for (int i = idx+1; i < end; i++)
if (data[idx] == data[i]) {
ln[i] = idx;
rn[idx] = i;
return;}
return;
}
};
template <typename T>
struct copy_functor
{
my_obj<T> *data;
my_obj<T> *result;
int *rn;
copy_functor(my_obj<T> *_in, my_obj<T> *_out, int *_rn) : data(_in), result(_out), rn(_rn) {};
__host__ __device__
void operator()(const thrust::tuple<int, int> &t1) const {
int idx1 = thrust::get<0>(t1);
int idx2 = thrust::get<1>(t1);
result[idx1] = data[idx2];
int i = rn[idx2];
int j = 1;
while (i != -1){
result[idx1+(j++)] = data[i];
i = rn[i];}
return;
}
};
struct count_functor
{
int *rn;
int *ot;
count_functor(int *_rn, int *_ot) : rn(_rn), ot(_ot) {};
__host__ __device__
int operator()(int idx1, int idx2){
ot[idx1] = idx2;
int i = rn[idx1];
int count = 1;
while (i != -1) {
ot[i] = idx2;
count++;
i = rn[i];}
return count;
}
};
using namespace thrust::placeholders;
int main(){
// data setup
char data[] = { 'a' , 'e' , 't' , 'b' , 'c' , 'a' , 'c' , 'e' , 't' , 'a' };
int sz = sizeof(data)/sizeof(char);
for (int i = 0; i < sz; i++) std::cout << data[i] << ",";
std::cout << std::endl;
thrust::host_vector<my_obj<char> > h_data(sz);
for (int i = 0; i < sz; i++) { h_data[i].set(data[i]); h_data[i].set_idx(i); }
thrust::device_vector<my_obj<char> > d_data = h_data;
// create left and right neighbor indices
thrust::device_vector<int> ln(d_data.size(), -1);
thrust::device_vector<int> rn(d_data.size(), -1);
thrust::for_each(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(0) + sz, search_functor<char>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(rn.data()), thrust::raw_pointer_cast(ln.data()), d_data.size()));
// determine number of unique objects
int uni_objs = thrust::count(ln.begin(), ln.end(), -1);
// determine the number of instances of each unique object
// get object starting indices
thrust::device_vector<int> uni_obj_idxs(uni_objs);
thrust::copy_if(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(0)+d_data.size(), ln.begin(), uni_obj_idxs.begin(), (_1 == -1));
// count each object list
thrust::device_vector<int> num_objs(uni_objs);
thrust::device_vector<int> obj_type(d_data.size());
thrust::transform(uni_obj_idxs.begin(), uni_obj_idxs.end(), thrust::counting_iterator<int>(0), num_objs.begin(), count_functor(thrust::raw_pointer_cast(rn.data()), thrust::raw_pointer_cast(obj_type.data())));
// at this point, we have built object lists that have allowed us to identify a unique, orderable "type" for each object
// the sensible thing to do would be to employ a sort_by_key on obj_type and an index sequence at this point
// and use the reordered index sequence to reorder the original objects, thus grouping them
// however... without sorting...
// build output vector indices
thrust::device_vector<int> copy_start(num_objs.size());
thrust::exclusive_scan(num_objs.begin(), num_objs.end(), copy_start.begin());
// copy (by object type) input to output
thrust::device_vector<my_obj<char> > d_result(d_data.size());
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(copy_start.begin(), uni_obj_idxs.begin())), thrust::make_zip_iterator(thrust::make_tuple(copy_start.end(), uni_obj_idxs.end())), copy_functor<char>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d_result.data()), thrust::raw_pointer_cast(rn.data())));
// display results
std::cout << "Grouped: " << std::endl;
for (int i = 0; i < d_data.size(); i++){
my_obj<char> temp = d_result[i];
std::cout << temp.get() << ",";}
std::cout << std::endl;
for (int i = 0; i < d_data.size(); i++){
my_obj<char> temp = d_result[i];
std::cout << temp.get_idx() << ",";}
std::cout << std::endl;
return 0;
}
$ nvcc -o t707 t707.cu
$ ./t707
a,e,t,b,c,a,c,e,t,a,
Grouped:
a,a,a,e,e,t,t,b,c,c,
0,5,9,1,7,2,8,3,4,6,
$

In the discussion we found out that your real goal is to eliminate duplicates in a vector of float4 elements.
In order to apply thrust::unique the elements need to be sorted.
So you need a sort method for 4 dimensional data. This can be done using space-filling curves. I have previously used the z-order curve (aka morton code) to sort 3D data. There are efficient CUDA implementations for the 3D case available, however quick googling did not return a ready-to-use implementation for the 4D case.
I found a paper which lists a generic algorithm for sorting n-dimensional data points using the z-order curve:
Fast construction of k-Nearest Neighbor Graphs for Point Clouds
(see Algorithm 1 : Floating Point Morton Order Algorithm).
There is also a C++ implementation available for this algorithm.
For 4D data, the loop could be unrolled, but there might be simpler and more efficient algorithms available.
So the (not fully implemented) sequence of operations would then look like this:
#include <thrust/device_vector.h>
#include <thrust/unique.h>
#include <thrust/sort.h>
inline __host__ __device__ float dot(const float4& a, const float4& b)
{
return a.x * b.x + a.y * b.y + a.z * b.z + a.w * b.w;
}
struct identity_4d
{
__host__ __device__
bool operator()(const float4& a, const float4& b) const
{
// based on the norm function you provided in the discussion
return dot(a,b) < (0.1f*0.1f);
}
};
struct z_order_4d
{
__host__ __device__
bool operator()(const float4& p, const float4& q) const
{
// you need to implement the z-order algorithm here
// ...
}
};
int main()
{
const int N = 100;
thrust::device_vector<float4> data(N);
// fill the data
// ...
thrust::sort(data.begin(),data.end(), z_order_4d());
thrust::unique(data.begin(),data.end(), identity_4d());
}

Related

Converting uint32_t to binary in C

The main problem I'm having is to read out values in binary in C. Python and C# had some really quick/easy functions to do this, I found topic about how to do it in C++, I found topic about how to convert int to binary in C, but not how to convert uint32_t to binary in C.
What I am trying to do is to read bit by bit the 32 bits of the DR_REG_RNG_BASE address of an ESP32 (this is the address where the random values of the Random Hardware Generator of the ESP are stored).
So for the moment I was doing that:
#define DR_REG_RNG_BASE 0x3ff75144
void printBitByBit( ){
// READ_PERI_REG is the ESP32 function to read DR_REG_RNG_BASE
uint32_t rndval = READ_PERI_REG(DR_REG_RNG_BASE);
int i;
for (i = 1; i <= 32; i++){
int mask = 1 << i;
int masked_n = rndval & mask;
int thebit = masked_n >> i;
Serial.printf("%i", thebit);
}
Serial.println("\n");
}
At first I thought it was working well. But in fact it takes me out of binary representations that are totally false. Any ideas?
Your shown code has a number of errors/issues.
First, bit positions for a uint32_t (32-bit unsigned integer) are zero-based – so, they run from 0 thru 31, not from 1 thru 32, as your code assumes. Thus, in your code, you are (effectively) ignoring the lowest bit (bit #0); further, when you do the 1 << i on the last loop (when i == 32), your mask will (most likely) have a value of zero (although that shift is, technically, undefined behaviour for a signed integer, as your code uses), so you'll also drop the highest bit.
Second, your code prints (from left-to-right) the lowest bit first, but you want (presumably) to print the highest bit first, as is normal. So, you should run the loop with the i index starting at 31 and decrement it to zero.
Also, your code mixes and mingles unsigned and signed integer types. This sort of thing is best avoided – so it's better to use uint32_t for the intermediate values used in the loop.
Lastly (as mentioned by Eric in the comments), there is a far simpler way to extract "bit n" from an unsigned integer: just use value >> n & 1.
I don't have access to an Arduino platform but, to demonstrate the points made in the above discussion, here is a standard, console-mode C++ program that compares the output of your code to versions with the aforementioned corrections applied:
#include <iostream>
#include <cstdint>
#include <inttypes.h>
int main()
{
uint32_t test = 0x84FF0048uL;
int i;
// Your code ...
for (i = 1; i <= 32; i++) {
int mask = 1 << i;
int masked_n = test & mask;
int thebit = masked_n >> i;
printf("%i", thebit);
}
printf("\n");
// Corrected limits/order/types ...
for (i = 31; i >= 0; --i) {
uint32_t mask = (uint32_t)(1) << i;
uint32_t masked_n = test & mask;
uint32_t thebit = masked_n >> i;
printf("%"PRIu32, thebit);
}
printf("\n");
// Better ...
for (i = 31; i >= 0; --i) {
printf("%"PRIu32, test >> i & 1);
}
printf("\n");
return 0;
}
The three lines of output (first one wrong, as you know; last two correct) are:
001001000000000111111110010000-10
10000100111111110000000001001000
10000100111111110000000001001000
Notes:
(1) On the use of the funny-looking "%"PRu32 format specifier for printing the uint32_t types, see: printf format specifiers for uint32_t and size_t.
(2) The cast on the (uint32_t)(1) constant will ensure that the bit-shift is safe, even when int and unsigned are 16-bit types; without that, you would get undefined behaviour in such a case.
When you printing out a binary string representation of a number, you print the Most Signification Bit (MSB) first, whether the number is a uint32_t or uint16_t, so you will need to have a mask for detecting whether the MSB is a 1 or 0, so you need a mask of 0x80000000, and shift-down on each iteration.
#define DR_REG_RNG_BASE 0x3ff75144
void printBitByBit( ){
// READ_PERI_REG is the ESP32 function to read DR_REG_RNG_BASE
uint32_t rndval = READ_PERI_REG(DR_REG_RNG_BASE);
Serial.println(rndval, HEX); //print out the value in hex for verification purpose
uint32_t mask = 0x80000000;
for (int i=1; i<32; i++) {
Serial.println((rndval & mask) ? "1" : "0");
mask = (uint32_t) mask >> 1;
}
Serial.println("\n");
}
For Arduino, there are actually a couple of built-in functions that can print out the binary string representation of a number. Serial.print(x, BIN) allows you to specify the number base on the 2nd function argument.
Another function that can achieve the same result is itoa(x, str, base) which is not part of standard ANSI C or C++, but available in Arduino to allow you to convert the number x to a str with number base specified.
char str[33];
itoa(rndval, str, 2);
Serial.println(str);
However, both functions does not pad with leading zero, see the result here:
36E68B6D // rndval in HEX
00110110111001101000101101101101 // print by our function
110110111001101000101101101101 // print by Serial.print(rndval, BIN)
110110111001101000101101101101 // print by itoa(rndval, str, 2)
BTW, Arduino is c++, so don't use c tag for your post. I changed it for you.

Need help optimizing thrust cuda code with nested iterator transform_reduce operations

I am working on code I would like to execute efficiently on a GPU. Most of the code has been easy to vectorize and prepare for parallel execution. There are several nice examples on Stack Overflow that have helped me with the standard nested iterators. I have one section I have not been able to successfully condense into an efficient thrust construct. I have taken that section of my code and made a minimum reproducible example. Any advice or hint on how to structure this code would be appreciated.
Thanks
#include <algorithm>
#include <iostream>
#include <numeric>
#include <vector>
#include <ctime>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
typedef thrust::device_vector<double> tDoubleVecDevice;
typedef tDoubleVecDevice::iterator tDoubleVecDeviceIter;
struct functorB{
template <typename T>
__host__ __device__
double operator()(const T &my_tuple){ // do some math
return ( fmod((thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple)),1.0) );
}
};
struct functorC {
template <typename T>
__host__ __device__
double operator()(const T &my_tuple){ // do some math
double distance = fabs( fmod((thrust::get<0>(my_tuple) - thrust::get<1>(my_tuple)),1.0));
return((fmin( distance, 1.0 - distance)) / (5.0));
}
};
int main(void)
{
tDoubleVecDevice resF(36);
tDoubleVecDevice freqI(36);
tDoubleVecDevice trialTs(128);
std::srand(std::time(nullptr));
for(tDoubleVecDeviceIter tIter = trialTs.begin();tIter < trialTs.end(); tIter++ ){
(*tIter) = rand() % 10 + 1.5; // make some random numbers
}
for(tDoubleVecDeviceIter rIter = resF.begin(), fIter = freqI.begin();fIter < resF.end(); rIter++ ,fIter++){
(*fIter) = rand() % 10 + 1.5; // make some random numbers
(*rIter) = rand() % 10 + 1.5; // make some random numbers
}
tDoubleVecDevice trialRs(36);
tDoubleVecDevice errorVect(128);
for( tDoubleVecDeviceIter itTrial = trialTs.begin(), itError = errorVect.begin(); itTrial != trialTs.end(); itTrial++,itError++){
thrust::transform( (thrust::make_zip_iterator(thrust::make_tuple(thrust::make_constant_iterator<double>(*itTrial), freqI.begin()))),
(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_constant_iterator<double>(*itTrial)+36, freqI.end()))),
trialRs.begin() ,functorB());
(*itError) =thrust::transform_reduce(
thrust::make_zip_iterator(thrust::make_tuple(trialRs.begin(),resF.begin())),
thrust::make_zip_iterator(thrust::make_tuple(trialRs.end(),resF.end())),
functorC(),(double) 0,thrust::plus<double>()
);
}
// finds the index of the minimum element;
int minElementIndex = thrust::min_element(errorVect.begin(),errorVect.end()) - errorVect.begin();
double result = trialTs[minElementIndex];
std::cout << "result = " << result;
return 0;
}
It looks like you need to expand your trialsTs,trialsRs,errorVect,freqI and resF vectors to 4608 elements. This will allow you to vectorize the loops. Derive a class from thrust::iterator_adaptor to make a cyclic iterator to expand your freqI and resF to create repeated sequences of the data in those vectors.
After you run your functors use a reduce by key transform to create your error result with each 36 element trial.
Give that a try and if you get stuck I will provide some additional code.

Polymorphism and derived classes in CUDA / CUDA Thrust

This is my first question on Stack Overflow, and it's quite a long question. The tl;dr version is: How do I work with a thrust::device_vector<BaseClass> if I want it to store objects of different types DerivedClass1, DerivedClass2, etc, simultaneously?
I want to take advantage of polymorphism with CUDA Thrust. I'm compiling for an -arch=sm_30 GPU (GeForce GTX 670).
Let us take a look at the following problem: Suppose there are 80 families in town. 60 of them are married couples, 20 of them are single-parent households. Each family has, therefore, a different number of members. It's census time and households have to state the parents' ages and the number of children they have. Therefore, an array of Family objects is constructed by the government, namely thrust::device_vector<Family> familiesInTown(80), such that information of families familiesInTown[0] to familiesInTown[59] corresponds to married couples, the rest (familiesInTown[60] to familiesInTown[79]) being single-parent households.
Family is the base class - the number of parents in the household (1 for single parents and 2 for couples) and the number of children they have are stored here as members.
SingleParent, derived from Family, includes a new member - the single parent's age, unsigned int ageOfParent.
MarriedCouple, also derived from Family, however, introduces two new members - both parents' ages, unsigned int ageOfParent1 and unsigned int ageOfParent2.
#include <iostream>
#include <stdio.h>
#include <thrust/device_vector.h>
class Family
{
protected:
unsigned int numParents;
unsigned int numChildren;
public:
__host__ __device__ Family() {};
__host__ __device__ Family(const unsigned int& nPars, const unsigned int& nChil) : numParents(nPars), numChildren(nChil) {};
__host__ __device__ virtual ~Family() {};
__host__ __device__ unsigned int showNumOfParents() {return numParents;}
__host__ __device__ unsigned int showNumOfChildren() {return numChildren;}
};
class SingleParent : public Family
{
protected:
unsigned int ageOfParent;
public:
__host__ __device__ SingleParent() {};
__host__ __device__ SingleParent(const unsigned int& nChil, const unsigned int& age) : Family(1, nChil), ageOfParent(age) {};
__host__ __device__ unsigned int showAgeOfParent() {return ageOfParent;}
};
class MarriedCouple : public Family
{
protected:
unsigned int ageOfParent1;
unsigned int ageOfParent2;
public:
__host__ __device__ MarriedCouple() {};
__host__ __device__ MarriedCouple(const unsigned int& nChil, const unsigned int& age1, const unsigned int& age2) : Family(2, nChil), ageOfParent1(age1), ageOfParent2(age2) {};
__host__ __device__ unsigned int showAgeOfParent1() {return ageOfParent1;}
__host__ __device__ unsigned int showAgeOfParent2() {return ageOfParent2;}
};
If I were to naïvely initiate the objects in my thrust::device_vector<Family> with the following functors:
struct initSlicedCouples : public thrust::unary_function<unsigned int, MarriedCouple>
{
__device__ MarriedCouple operator()(const unsigned int& idx) const
// I use a thrust::counting_iterator to get idx
{
return MarriedCouple(idx % 3, 20 + idx, 19 + idx);
// Couple 0: Ages 20 and 19, no children
// Couple 1: Ages 21 and 20, 1 child
// Couple 2: Ages 22 and 21, 2 children
// Couple 3: Ages 23 and 22, no children
// etc
}
};
struct initSlicedSingles : public thrust::unary_function<unsigned int, SingleParent>
{
__device__ SingleParent operator()(const unsigned int& idx) const
{
return SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initSlicedCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSlicedSingles());
return 0;
}
I would definitely be guilty of some classic object slicing...
So, I asked myself, what about a vector of pointers that may give me some sweet polymorphism? Smart pointers in C++ are a thing, and thrust iterators can do some really impressive things, so let's give it a shot, I figured. The following code compiles.
struct initCouples : public thrust::unary_function<unsigned int, MarriedCouple*>
{
__device__ MarriedCouple* operator()(const unsigned int& idx) const
{
return new MarriedCouple(idx % 3, 20 + idx, 19 + idx); // Memory issues?
}
};
struct initSingles : public thrust::unary_function<unsigned int, SingleParent*>
{
__device__ SingleParent* operator()(const unsigned int& idx) const
{
return new SingleParent(idx % 3, 25 + idx);
}
};
int main()
{
unsigned int Num_couples = 60;
unsigned int Num_single_parents = 20;
thrust::device_vector<Family*> familiesInTown(Num_couples + Num_single_parents);
// Families [0] to [59] are couples. Families [60] to [79] are single-parent households.
thrust::transform(thrust::counting_iterator<unsigned int>(0),
thrust::counting_iterator<unsigned int>(Num_couples),
familiesInTown.begin(),
initCouples());
thrust::transform(thrust::counting_iterator<unsigned int>(Num_couples),
thrust::counting_iterator<unsigned int>(Num_couples + Num_single_parents),
familiesInTown.begin() + Num_couples,
initSingles());
Family A = *(familiesInTown[2]); // Compiles, but object slicing takes place (in theory)
std::cout << A.showNumOfParents() << "\n"; // Segmentation fault
return 0;
}
Seems like I've hit a wall here. Am I understanding memory management correctly? (VTables, etc). Are my objects being instantiated and populated on the device? Am I leaking memory like there is no tomorrow?
For what it's worth, in order to avoid object slicing, I tried with a dynamic_cast<DerivedPointer*>(basePointer). That's why I made my Family destructor virtual.
Family *pA = familiesInTown[2];
MarriedCouple *pB = dynamic_cast<MarriedCouple*>(pA);
The following lines compile, but, unfortunately, a segfault is thrown again. CUDA-Memcheck won't tell me why.
std::cout << "Ages " << (pB -> showAgeOfParent1()) << ", " << (pB -> showAgeOfParent2()) << "\n";
and
MarriedCouple B = *pB;
std::cout << "Ages " << B.showAgeOfParent1() << ", " << B.showAgeOfParent2() << "\n";
In short, what I need is a class interface for objects that will have different properties, with different numbers of members among each other, but that I can store in one common vector (that's why I want a base class) that I can manipulate on the GPU. My intention is to work with them both in thrust transformations and in CUDA kernels via thrust::raw_pointer_casting, which has worked flawlessly for me until I've needed to branch out my classes into a base one and several derived ones. What is the standard procedure for that?
Thanks in advance!
I am not going to attempt to answer everything in this question, it is just too large. Having said that here are some observations about the code you posted which might help:
The GPU side new operator allocates memory from a private runtime heap. As of CUDA 6, that memory cannot be accessed by the host side CUDA APIs. You can access the memory from within kernels and device functions, but that memory cannot be accessed by the host. So using new inside a thrust device functor is a broken design that can never work. That is why your "vector of pointers" model fails.
Thrust is fundamentally intended to allow data parallel versions of typical STL algorithms to be applied to POD types. Building a codebase using complex polymorphic objects and trying to cram those through Thrust containers and algorithms might be made to work, but it isn't what Thrust was designed for, and I wouldn't recommend it. Don't be surprised if you break thrust in unexpected ways if you do.
CUDA supports a lot of C++ features, but the compilation and object models are much simpler than even the C++98 standard upon which they are based. CUDA lacks several key features (RTTI for example) which make complex polymorphic object designs workable in C++. My suggestion is use C++ features sparingly. Just because you can do something in CUDA doesn't mean you should. The GPU is a simple architecture and simple data structures and code are almost always more performant than functionally similar complex objects.
Having skim read the code you posted, my overall recommendation is to go back to the drawing board. If you want to look at some very elegant CUDA/C++ designs, spend some time reading the code bases of CUB and CUSP. They are both very different, but there is a lot to learn from both (and CUSP is built on top of Thrust, which makes it even more relevant to your usage case, I suspect).
I completely agree with #talonmies answer. (e.g. I don't know that thrust has been extensively tested with polymorphism.) Furthermore, I have not fully parsed your code. I post this answer to add additional info, in particular that I believe some level of polymorphism can be made to work with thrust.
A key observation I would make is that it is not allowed to pass as an argument to a __global__ function an object of a class with virtual functions. This means that polymorphic objects created on the host cannot be passed to the device (via thrust, or in ordinary CUDA C++). (One basis for this limitation is the requirement for virtual function tables in the objects, which will necessarily be different between host and device, coupled with the fact that it is illegal to directly take the address of a device function in host code).
However, polymorphism can work in device code, including thrust device functions.
The following example demonstrates this idea, restricting ourselves to objects created on the device although we can certainly initialize them with host data. I have created two classes, Triangle and Rectangle, derived from a base class Polygon which includes a virtual function area. Triangle and Rectangle inherit the function set_values from the base class but replace the virtual area function.
We can then manipulate objects of those classes polymorphically as demonstrated here:
#include <iostream>
#include <thrust/device_vector.h>
#include <thrust/for_each.h>
#include <thrust/sequence.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#define N 4
class Polygon {
protected:
int width, height;
public:
__host__ __device__ void set_values (int a, int b)
{ width=a; height=b; }
__host__ __device__ virtual int area ()
{ return 0; }
};
class Rectangle: public Polygon {
public:
__host__ __device__ int area ()
{ return width * height; }
};
class Triangle: public Polygon {
public:
__host__ __device__ int area ()
{ return (width * height / 2); }
};
struct init_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
(thrust::get<0>(arg)).set_values(thrust::get<1>(arg), thrust::get<2>(arg));}
};
struct setup_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
if (thrust::get<0>(arg) == 0)
thrust::get<1>(arg) = &(thrust::get<2>(arg));
else
thrust::get<1>(arg) = &(thrust::get<3>(arg));}
};
struct area_f {
template <typename Tuple>
__host__ __device__ void operator()(const Tuple &arg) {
thrust::get<1>(arg) = (thrust::get<0>(arg))->area();}
};
int main () {
thrust::device_vector<int> widths(N);
thrust::device_vector<int> heights(N);
thrust::sequence( widths.begin(), widths.end(), 2);
thrust::sequence(heights.begin(), heights.end(), 3);
thrust::device_vector<Rectangle> rects(N);
thrust::device_vector<Triangle> trgls(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(rects.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(rects.end(), widths.end(), heights.end())), init_f());
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(trgls.begin(), widths.begin(), heights.begin())), thrust::make_zip_iterator(thrust::make_tuple(trgls.end(), widths.end(), heights.end())), init_f());
thrust::device_vector<Polygon *> polys(N);
thrust::device_vector<int> selector(N);
for (int i = 0; i<N; i++) selector[i] = i%2;
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(selector.begin(), polys.begin(), rects.begin(), trgls.begin())), thrust::make_zip_iterator(thrust::make_tuple(selector.end(), polys.end(), rects.end(), trgls.end())), setup_f());
thrust::device_vector<int> areas(N);
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(polys.begin(), areas.begin())), thrust::make_zip_iterator(thrust::make_tuple(polys.end(), areas.end())), area_f());
thrust::copy(areas.begin(), areas.end(), std::ostream_iterator<int>(std::cout, "\n"));
return 0;
}
I suggest compiling the above code for a cc2.0 or newer architecture. I tested with CUDA 6 on RHEL 5.5.
(The polymorphic example idea, and some of the code, was taken from here.)

how to get the index of thrust foreach

I am trying to using thrust for each to give device vector certain values
here is the code
const uint N = 222222;
struct assign_functor
{
template <typename Tuple>
__device__
void operator()(Tuple t)
{
uint x = threadIdx.x + blockIdx.x * blockDim.x;
uint y = threadIdx.y + blockIdx.y * blockDim.y;
uint offset = x + y * blockDim.x * gridDim.x;
thrust::get<0>(t) = offset;
}
};
int main(int argc, char** argv)
{
thrust::device_vector <float> d_float_vec(N);
thrust::for_each(
thrust::make_zip_iterator(
thrust::make_tuple(d_float_vec.begin())
),
thrust::make_zip_iterator(
thrust::make_tuple(d_float_vec.end())
),
assign_functor()
);
std::cout<<d_float_vec[10]<<" "<<d_float_vec[N-2]
}
the output of d_float_vec[N-2] is supposed to be 222220; but it turns out 1036. whats wrong with my code??
I know I could use thrust::sequence to give a sequence values to the vector. I just want to know how to get the real index for thrust foreach function. Thanks!
As noted in comments, your approach is never likely to work because you have assumed a number of things about the way thrust::for_each works internally which are probably not true, including:
You implicitly are assuming that for_each uses a single thread to process each input element. This is almost certainly not the case; it is much more likely that thrust will process multiple elements per thread during the operation.
You are also assuming that execution happens in order so that the Nth thread processes the Nth array element. That may not be the case, and execution may occur in an order which cannot be known a priori
You are assuming for_each processes the whole input data set in a single kernel laumch
Thrust algorithms should be treated as black boxes whose internal operations are undefined and no knowledge of them is required to implement user defined functors. In your example, if you require a sequential index inside a functor, pass a counting iterator. One way to re-write your example would be like this:
#include "thrust/device_vector.h"
#include "thrust/for_each.h"
#include "thrust/tuple.h"
#include "thrust/iterator/counting_iterator.h"
typedef unsigned int uint;
const uint N = 222222;
struct assign_functor
{
template <typename Tuple>
__device__
void operator()(Tuple t)
{
thrust::get<1>(t) = (float)thrust::get<0>(t);
}
};
int main(int argc, char** argv)
{
thrust::device_vector <float> d_float_vec(N);
thrust::counting_iterator<uint> first(0);
thrust::counting_iterator<uint> last = first + N;
thrust::for_each(
thrust::make_zip_iterator(
thrust::make_tuple(first, d_float_vec.begin())
),
thrust::make_zip_iterator(
thrust::make_tuple(last, d_float_vec.end())
),
assign_functor()
);
std::cout<<d_float_vec[10]<<" "<<d_float_vec[N-2]<<std::endl;
}
Here the counting iterator gets passed in a tuple along with the data array, allow the functor access to a sequential index which corresponds to the data array entry it is dealing with.

How to gather rows from a matrix by indices list using CUDA Thrust

This is seemingly a simple problem but I just can’t figure out an elegant way to do this with CUDA Thrust.
I have a two dimensional matrix NxM and a vector of desired row indices of size L that is a subset of all rows(i.e. L < N) and is not regular (basically an irregular list like, 7,11,13,205,... etc.). The matrix is stored by rows in a thrust device vector. The array of indices is a device vector as well.
Here are my two questions:
What is the most efficient way to copy the desired rows from the original NxM matrix forming a new matrix LxM?
Is it possible to create an iterator for the original NxM matrix that would dereference to only elements that belong to the desired rows?
Thank you very much for your help.
What you are asking about seems like a pretty straight forward stream compaction problem, and there isn't any particular problem doing it with thrust, but there are a couple of twists. In order to select the rows to copy, you need to have an stencil or key that the stream compaction algorithm can use. That needs to be constructed by a search or select operation using your list of rows to copy.
One example procedure to do this would go something like this:
Construct an iterator which returns the row number of any entry in the input matrix. Thrust has a very useful counting_iterator and transform_iterator which can be combined to do this
Perform a search of that row number iterator to find which entries match the list of rows to copy. thrust::binary search can be used for this. The search yields the stencil for the stream compaction operation
Use thrust::copy_if to perform the stream compaction on the input matrix with the stencil.
It sounds like a lot of work and intermediate steps, but the counting and transformation iterators don't actually produce any intermediate device vectors. The only intermediate storage required is the stencil array, which can be a boolean (so m*n bytes).
A full example in code:
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
struct div_functor : public thrust::unary_function<int,int>
{
int m;
div_functor(int _m) : m(_m) {};
__host__ __device__
int operator()(int x) const
{
return x / m;
}
};
struct is_true
{
__host__ __device__
bool operator()(bool x) { return x; }
};
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Counting iterator for generating sequential indices
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// construct device iterator supplying row numbers via a functor
typedef thrust::counting_iterator<int> counter;
typedef thrust::transform_iterator<div_functor, counter> rowIterator;
rowIterator rows_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), div_functor(n));
rowIterator rows_end = rows_begin + (m*n);
// constructor a stencil array which indicates which entries will be copied
thrust::device_vector<bool> docopy(m*n);
thrust::binary_search(select.begin(), select.end(), rows_begin, rows_end, docopy.begin());
// use stream compaction on the matrix with the stencil array
thrust::device_vector<float> out_matrix(l*n);
thrust::copy_if(in_matrix.begin(), in_matrix.end(), docopy.begin(), out_matrix.begin(), is_true());
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(usual disclaimer: use at your own risk)
About the only comment I would make is that the predicate to the copy_if call feels a bit redundant given we have already a binary stencil that could be used directly, but there doesn't seem to be a variant of the compaction algorithms which can operate on a binary stencil directly. Similarly, I could not think of a sensible way to use the list of rows directly in the stream compaction call. There might well be a more efficient way to do this with thrust, but this should at least get you started.
From your comment, it seems that space is tight and the additional memory overhead of the binary search and stencil creation is prohibitive for your application. In that case I would follow the advice I offered in a comment to Roger Dahl's answer, and use a custom copy kernel instead. Thrust device vectors can be cast to a pointer you can pass directly to a kernel (thrust::raw_pointer_cast), so it need not interfere with your existing thrust code. I would suggest using a block of threads per row to copy, that allows coalescing of reads and writes and should perform a lot better than using thrust::copy for each row. A very simple implementation might look something like this (reusing most of my thrust example):
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
__global__
void rowcopykernel(const float *in, float *out, const int *list, const int m, const int n, const int l)
{
__shared__ const float * inrowp;
__shared__ float * outrowp;
if (threadIdx.x == 0) {
inrowp = (blockIdx.x < l) ? in + (n*list[blockIdx.x]) : 0;
outrowp = out + (n*blockIdx.x);
}
__syncthreads();
for(int i=threadIdx.x; (inrowp != 0) && (i<n); i+=blockDim.x) {
*(outrowp+i) = *(inrowp+i);
}
}
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// Output matrix
thrust::device_vector<float> out_matrix(l*n);
// raw pointer to thrust vectors
int * selp = thrust::raw_pointer_cast(&select[0]);
float * inp = thrust::raw_pointer_cast(&in_matrix[0]);
float * outp = thrust::raw_pointer_cast(&out_matrix[0]);
dim3 blockdim = dim3(128);
dim3 griddim = dim3(l);
rowcopykernel<<<griddim,blockdim>>>(inp, outp, selp, m, n, l);
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(standard disclaimer: use at your own risk).
The execution parameter selection could be made fancier, but otherwise that should be about all that is required. If your rows are very small, you might want to investigate using a warp per row rather than a block (so one block copies several rows). If you have more than 65535 output rows, then you will need to either use a 2D grid, or modify the code to have each block do multiple rows. But, as with the thrust based solution about, this should get you started.
if you are not fixed on thrust, check out Arrafire:
surprisingly unlike thrust, this library has a native support for subscript indexing,
so that your problem can be solved in just few lines of code:
const int N = 7, M = 5;
float L_host[] = {3, 6, 4, 1};
int szL = sizeof(L_host) / sizeof(float);
// generate random NxM matrix with cuComplex data
array A = randu(N, M, c32);
// array used to index rows
array L(szL, 1, L_host);
print(A);
print(L);
array B = A(L,span); // copy selected rows of A
print(B);
and the results:
A =
0.7402 + 0.9210i 0.6814 + 0.2920i 0.5786 + 0.5538i 0.2133 + 0.4131i 0.7305 + 0.9400i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
0.9251 + 0.4464i 0.1541 + 0.4452i 0.2783 + 0.6192i 0.7214 + 0.3546i 0.2674 + 0.0208i
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.7762 + 0.2948i 0.2343 + 0.8793i 0.0937 + 0.6326i 0.1820 + 0.5984i 0.5298 + 0.8127i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
L = (row indices)
3.0000
6.0000
4.0000
1.0000
B =
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
it also works pretty fast. I tested this with an array of cuComplex of size
2000 x 2000 using the following code:
float *g_data = 0, *g_data2 = 0;
int g_N = 2000, g_M = 2000, // matrix of size g_N x g_M
g_L = 400; // copy g_L rows
void af_test()
{
array A(g_N, g_M, (cuComplex *)g_data, afDevicePointer);
array L(g_L, 1, g_data2, afDevicePointer);
array B = (A(L, span));
std::cout << "sz: " << B.elements() << "\n";
}
int main()
{
// input matrix N x M of cuComplex
array in = randu(g_N, g_M, c32);
g_data = (float *)in.device< cuComplex >();
// generate unique row indices
array in2 = setunique(floor(randu(g_L) * g_N));
print(in2);
g_data2 = in2.device<float>();
const int N_ITERS = 30;
try {
info();
af::sync();
timer::tic();
for(int i = 0; i < N_ITERS; i++) {
af_test();
}
af::sync();
printf("af: %.5f seconds\n", timer::toc() / N_ITERS);
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
}
in.unlock();
in2.unlock();
}
I don't think there is a way to do this with Thrust but, because the operation will be memory bound, it should be easy to write a kernel that performs this operation at maximum possible performance. Simply create the same number of threads as there are indices in the vector. Have each thread calculate the source and destination addresses for one row and then use memcpy() to copy the row.
You may also want to carefully consider if it is possible to set up subsequent processing steps to access the rows in place, thereby avoiding the entire, expensive "compacting" operation, that only shuffles memory around. Even if addressing the rows becomes slightly more complicated (an extra memory lookup and multiply, maybe), overall performance may be much better.