How to gather rows from a matrix by indices list using CUDA Thrust - cuda

This is seemingly a simple problem but I just can’t figure out an elegant way to do this with CUDA Thrust.
I have a two dimensional matrix NxM and a vector of desired row indices of size L that is a subset of all rows(i.e. L < N) and is not regular (basically an irregular list like, 7,11,13,205,... etc.). The matrix is stored by rows in a thrust device vector. The array of indices is a device vector as well.
Here are my two questions:
What is the most efficient way to copy the desired rows from the original NxM matrix forming a new matrix LxM?
Is it possible to create an iterator for the original NxM matrix that would dereference to only elements that belong to the desired rows?
Thank you very much for your help.

What you are asking about seems like a pretty straight forward stream compaction problem, and there isn't any particular problem doing it with thrust, but there are a couple of twists. In order to select the rows to copy, you need to have an stencil or key that the stream compaction algorithm can use. That needs to be constructed by a search or select operation using your list of rows to copy.
One example procedure to do this would go something like this:
Construct an iterator which returns the row number of any entry in the input matrix. Thrust has a very useful counting_iterator and transform_iterator which can be combined to do this
Perform a search of that row number iterator to find which entries match the list of rows to copy. thrust::binary search can be used for this. The search yields the stencil for the stream compaction operation
Use thrust::copy_if to perform the stream compaction on the input matrix with the stencil.
It sounds like a lot of work and intermediate steps, but the counting and transformation iterators don't actually produce any intermediate device vectors. The only intermediate storage required is the stencil array, which can be a boolean (so m*n bytes).
A full example in code:
#include <thrust/copy.h>
#include <thrust/binary_search.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
struct div_functor : public thrust::unary_function<int,int>
{
int m;
div_functor(int _m) : m(_m) {};
__host__ __device__
int operator()(int x) const
{
return x / m;
}
};
struct is_true
{
__host__ __device__
bool operator()(bool x) { return x; }
};
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Counting iterator for generating sequential indices
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// construct device iterator supplying row numbers via a functor
typedef thrust::counting_iterator<int> counter;
typedef thrust::transform_iterator<div_functor, counter> rowIterator;
rowIterator rows_begin = thrust::make_transform_iterator(thrust::make_counting_iterator(0), div_functor(n));
rowIterator rows_end = rows_begin + (m*n);
// constructor a stencil array which indicates which entries will be copied
thrust::device_vector<bool> docopy(m*n);
thrust::binary_search(select.begin(), select.end(), rows_begin, rows_end, docopy.begin());
// use stream compaction on the matrix with the stencil array
thrust::device_vector<float> out_matrix(l*n);
thrust::copy_if(in_matrix.begin(), in_matrix.end(), docopy.begin(), out_matrix.begin(), is_true());
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(usual disclaimer: use at your own risk)
About the only comment I would make is that the predicate to the copy_if call feels a bit redundant given we have already a binary stencil that could be used directly, but there doesn't seem to be a variant of the compaction algorithms which can operate on a binary stencil directly. Similarly, I could not think of a sensible way to use the list of rows directly in the stream compaction call. There might well be a more efficient way to do this with thrust, but this should at least get you started.
From your comment, it seems that space is tight and the additional memory overhead of the binary search and stencil creation is prohibitive for your application. In that case I would follow the advice I offered in a comment to Roger Dahl's answer, and use a custom copy kernel instead. Thrust device vectors can be cast to a pointer you can pass directly to a kernel (thrust::raw_pointer_cast), so it need not interfere with your existing thrust code. I would suggest using a block of threads per row to copy, that allows coalescing of reads and writes and should perform a lot better than using thrust::copy for each row. A very simple implementation might look something like this (reusing most of my thrust example):
#include <thrust/copy.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/device_vector.h>
#include <cstdio>
__global__
void rowcopykernel(const float *in, float *out, const int *list, const int m, const int n, const int l)
{
__shared__ const float * inrowp;
__shared__ float * outrowp;
if (threadIdx.x == 0) {
inrowp = (blockIdx.x < l) ? in + (n*list[blockIdx.x]) : 0;
outrowp = out + (n*blockIdx.x);
}
__syncthreads();
for(int i=threadIdx.x; (inrowp != 0) && (i<n); i+=blockDim.x) {
*(outrowp+i) = *(inrowp+i);
}
}
int main(void)
{
// dimensions of the problem
const int m=20, n=5, l=4;
// Sample matrix containing 0...(m*n)
thrust::counting_iterator<float> indices(0.f);
thrust::device_vector<float> in_matrix(m*n);
thrust::copy(indices, indices+(m*n), in_matrix.begin());
// device vector contain rows to select
thrust::device_vector<int> select(l);
select[0] = 1;
select[1] = 4;
select[2] = 9;
select[3] = 16;
// Output matrix
thrust::device_vector<float> out_matrix(l*n);
// raw pointer to thrust vectors
int * selp = thrust::raw_pointer_cast(&select[0]);
float * inp = thrust::raw_pointer_cast(&in_matrix[0]);
float * outp = thrust::raw_pointer_cast(&out_matrix[0]);
dim3 blockdim = dim3(128);
dim3 griddim = dim3(l);
rowcopykernel<<<griddim,blockdim>>>(inp, outp, selp, m, n, l);
for(int i=0; i<(l*n); i++) {
float val = out_matrix[i];
printf("%i %f\n", i, val);
}
}
(standard disclaimer: use at your own risk).
The execution parameter selection could be made fancier, but otherwise that should be about all that is required. If your rows are very small, you might want to investigate using a warp per row rather than a block (so one block copies several rows). If you have more than 65535 output rows, then you will need to either use a 2D grid, or modify the code to have each block do multiple rows. But, as with the thrust based solution about, this should get you started.

if you are not fixed on thrust, check out Arrafire:
surprisingly unlike thrust, this library has a native support for subscript indexing,
so that your problem can be solved in just few lines of code:
const int N = 7, M = 5;
float L_host[] = {3, 6, 4, 1};
int szL = sizeof(L_host) / sizeof(float);
// generate random NxM matrix with cuComplex data
array A = randu(N, M, c32);
// array used to index rows
array L(szL, 1, L_host);
print(A);
print(L);
array B = A(L,span); // copy selected rows of A
print(B);
and the results:
A =
0.7402 + 0.9210i 0.6814 + 0.2920i 0.5786 + 0.5538i 0.2133 + 0.4131i 0.7305 + 0.9400i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
0.9251 + 0.4464i 0.1541 + 0.4452i 0.2783 + 0.6192i 0.7214 + 0.3546i 0.2674 + 0.0208i
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.7762 + 0.2948i 0.2343 + 0.8793i 0.0937 + 0.6326i 0.1820 + 0.5984i 0.5298 + 0.8127i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
L = (row indices)
3.0000
6.0000
4.0000
1.0000
B =
0.6673 + 0.1099i 0.2080 + 0.6110i 0.5876 + 0.3750i 0.2527 + 0.9847i 0.8331 + 0.7218i
0.7140 + 0.3585i 0.6462 + 0.9264i 0.2849 + 0.7793i 0.7082 + 0.0421i 0.0593 + 0.4797i
0.4702 + 0.5132i 0.3073 + 0.4156i 0.2405 + 0.4148i 0.9200 + 0.1872i 0.6087 + 0.6301i
0.0390 + 0.9690i 0.3194 + 0.8109i 0.3557 + 0.7229i 0.0328 + 0.5360i 0.8432 + 0.6116i
it also works pretty fast. I tested this with an array of cuComplex of size
2000 x 2000 using the following code:
float *g_data = 0, *g_data2 = 0;
int g_N = 2000, g_M = 2000, // matrix of size g_N x g_M
g_L = 400; // copy g_L rows
void af_test()
{
array A(g_N, g_M, (cuComplex *)g_data, afDevicePointer);
array L(g_L, 1, g_data2, afDevicePointer);
array B = (A(L, span));
std::cout << "sz: " << B.elements() << "\n";
}
int main()
{
// input matrix N x M of cuComplex
array in = randu(g_N, g_M, c32);
g_data = (float *)in.device< cuComplex >();
// generate unique row indices
array in2 = setunique(floor(randu(g_L) * g_N));
print(in2);
g_data2 = in2.device<float>();
const int N_ITERS = 30;
try {
info();
af::sync();
timer::tic();
for(int i = 0; i < N_ITERS; i++) {
af_test();
}
af::sync();
printf("af: %.5f seconds\n", timer::toc() / N_ITERS);
} catch (af::exception& e) {
fprintf(stderr, "%s\n", e.what());
}
in.unlock();
in2.unlock();
}

I don't think there is a way to do this with Thrust but, because the operation will be memory bound, it should be easy to write a kernel that performs this operation at maximum possible performance. Simply create the same number of threads as there are indices in the vector. Have each thread calculate the source and destination addresses for one row and then use memcpy() to copy the row.
You may also want to carefully consider if it is possible to set up subsequent processing steps to access the rows in place, thereby avoiding the entire, expensive "compacting" operation, that only shuffles memory around. Even if addressing the rows becomes slightly more complicated (an extra memory lookup and multiply, maybe), overall performance may be much better.

Related

PyCUDA illegal memory access of curandState*

I'm studying the spread of an invasive species and am trying to generate random numbers within a PyCUDA kernel using the XORWOW random number generator. The matrices I need to be able to use as input in the study are quite large (up to 8,000 x 8,000).
The error seems to occur inside get_random_number when indexing the curandState* of the XORWOW generator. The code executes without errors on smaller matrices and produces correct results. I'm running my code on 2 NVidia Tesla K20X GPUs.
Kernel code and setup:
kernel_code = '''
#include <curand_kernel.h>
#include <math.h>
extern "C" {
__device__ float get_random_number(curandState* global_state, int thread_id) {
curandState local_state = global_state[thread_id];
float num = curand_uniform(&local_state);
global_state[thread_id] = local_state;
return num;
}
__global__ void survival_of_the_fittest(float* grid_a, float* grid_b, curandState* global_state, int grid_size, float* survival_probabilities) {
int x = threadIdx.x + blockIdx.x * blockDim.x; // column index of cell
int y = threadIdx.y + blockIdx.y * blockDim.y; // row index of cell
// make sure this cell is within bounds of grid
if (x < grid_size && y < grid_size) {
int thread_id = y * grid_size + x; // thread index
grid_b[thread_id] = grid_a[thread_id]; // copy current cell
float num;
// ignore cell if it is not already populated
if (grid_a[thread_id] > 0.0) {
num = get_random_number(global_state, thread_id);
// agents in this cell die
if (num < survival_probabilities[thread_id]) {
grid_b[thread_id] = 0.0; // cell dies
//printf("Cell (%d,%d) died (probability of death was %f)\\n", x, y, survival_probabilities[thread_id]);
}
}
}
}
mod = SourceModule(kernel_code, no_extern_c = True)
survival = mod.get_function('survival_of_the_fittest')
Data setup:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
grid_a = gpuarray.to_gpu(np.ones((matrix_size,matrix_size)).astype(np.float32))
grid_b = gpuarray.to_gpu(np.zeros((matrix_size,matrix_size)).astype(np.float32))
generator = curandom.XORWOWRandomNumberGenerator()
grid_size = np.int32(matrix_size)
survival_probabilities = gpuarray.to_gpu(np.random.uniform(0,1,(matrix_size,matrix_size)))
Kernel call:
survival(grid_a, grid_b, generator.state, grid_size, survival_probabilities,
grid = (grid_dims, grid_dims), block = (block_dims, block_dims, 1))
I expect to be able to generate random numbers within the range (0,1] for matrices up to (8,000 x 8,000), but executing my code on large matrices leads to an illegal memory access error.
pycuda._driver.LogicError: cuMemcpyDtoH failed: an illegal memory access was encountered
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuMemFree failed: an illegal memory access was encountered
Am I indexing the curandState* incorrectly in get_random_number? And if not, what else might be causing this error?
The problem here is a disconnect between this code which determines the size of the state which the PyCUDA curandom interface allocates for its internal state and this code in your post:
matrix_size = 2000
block_dims = 32
grid_dims = (matrix_size + block_dims - 1) // block_dims
You seem to be assuming that PyCUDA will magically allocate enough state for whatever block and grid dimension you select in you code. That is obviously unlikely, particularly at large grid sizes. You either need to
Modify your code to use the same block and grid sizes as the curandom module uses internally for whichever generator you choose to use, or
Allocate and manage your own state scratch space so that you have enough state allocated to service the block and grid sizes you select
I leave it as an exercise to the reader as to which one of these two approaches will work better in your application.

Sum a variable over all threads in a CUDA Kernel and return it to Host

I new in cuda and I'm try to implement a Kernel to calculate the energy of my Metropolis Monte Carlo Simulation.
I'll put here the linear version of this function:
float calc_energy(struct frame frm, float L, float rc){
int i,j;
float E=0, rij, dx, dy, dz;
for(i=0; i<frm.natm; i++)
{
for(j=i+1; j<frm.natm; j++)
{
dx = fabs(frm.conf[j][0] - frm.conf[i][0]);
dy = fabs(frm.conf[j][1] - frm.conf[i][1]);
dz = fabs(frm.conf[j][2] - frm.conf[i][2]);
dx = dx - round(dx/L)*L;
dy = dy - round(dy/L)*L;
dz = dz - round(dz/L)*L;
/*rij*/
rij = sqrt(dx*dx + dy*dy + dz*dz);
if (rij <= rc)
{
E = E + (4*((1/pow(rij,12))-(1/pow(rij,6))));
}
}
}
return E;
Then I'm try to parallelize this using Cuda: This is my idea:
void calc_energy(frame* s, float L, float rc)
{
extern __shared__ float E;
int i = blockDim.x*blockIdx.x + threadIdx.x;
int j = blockDim.y*blockIdx.y + threadIdx.y;
float rij, dx, dy, dz;
dx = fabs(s->conf[j][0] - s->conf[i][0]);
dy = fabs(s->conf[j][1] - s->conf[i][1]);
dz = fabs(s->conf[j][2] - s->conf[i][2]);
dx = dx - round(dx/L)*L;
dy = dy - round(dy/L)*L;
dz = dz - round(dz/L)*L;
rij = sqrt(dx*dx + dy*dy + dz*dz);
if (rij <= rc)
{
E += (4*((1/pow(rij,12))-(1/pow(rij,6)))); //<- here is the big problem
}
}
My main question is how to sum the variable E from each thread and return it to the host??. I intend to use as many thread and blocks as possible.
Obviously a part of the code is missing when the variable E is calculated.
I have read a few things about reduction methods, but I would like to know if this is necessary here.
I call the kernel using the following code:
calc_energy<<<dimGrid,dimBlock>>>(d_state, 100, 5);
edit:
I understood that I needed to use reduction methods. CUB work great to me.
Continuing with the implementation of the code, I realized that I have a new problem, perhaps because of my lack of knowledge in this area.
In my nested loop, the variable (frm.natm) can reach values in the order of 10^5. thinking of my GPU (GTX 750ti) the number of Thread per block is 1024 and the number of Block per grid is 1024. If I understood correctly, the maximum number of runs in a kernel is 1024x1024 = 1048576 (less than that actually).
So if I need to do 10^5 x 10^5 = 10^10 calculations in my nested loop, what would be the best way to think of the algorithm? Choose a fixed number (that fits my GPU) and split the calculations would be a good idea?
My main question is how to sum the variable E from each thread and return it to the host?
You will need to sum each threads calculation at a block level first using some form of block-wise parallel reduction (I recommend the CUB block wise reduction implementation for that).
Once each block has a partial sum from its threads, the block sums need to be combined. This can either be done on the atomically by one thread from each block, by a second kernel call (with one block), or on the host. How and where you will use the final sum will determine which of those options is the most optimal for your application.
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>
int main(void)
{
thrust::host_vector<int> h_vec(100);
std::generate(h_vec.begin(), h_vec.end(), rand);
thrust::device_vector<int> d_vec = h_vec;
int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
std::cout<< x<< std::endl;
return 0;
}

How to bring equal elements together using thrust without sort

I have an array of elements such that each element defines the "equal to" operator only.
In other words no ordering is defined for such type of element.
Since I can't use thrust::sort as in the thrust histogram example how can I bring equal elements together using thrust?
For example:
my array is initially
a e t b c a c e t a
where identical characters represent equal elements.
After the elaboration, the array should be
a a a t t b c c e e
but it can be also
a a a c c t t e e b
or any other permutation.
I would recommend that you follow an approach such as that laid out by #m.s. in the posted answer there. As I stated in the comments, ordering of elements is an extremely useful mechanism that aids in the reduction of complexity for problems like this.
However the question as posed asks if it is possible to group like elements without sorting. With an inherently parallel processor like a GPU, I spent some time thinking about how it might be accomplished without sorting.
If we have both a large number of objects, as well as a large number of unique object types, then I think it's possible to bring some level of parallelism to the problem, however my approach outlined here will still have atrocious, scattered memory access patterns. For the case where there are only a small number of distinct or unique object types, the algorithm I am discussing here has little to commend it. This is just one possible approach. There may well be other, far better approaches:
The starting point is to develop a set of "linked lists" that indicate the matching neighbor to the left and the matching neighbor to the right, for each element. This is accomplished via my search_functor and thrust::for_each, on the entire data set. This step is reasonably parallel and also has reasonable memory access efficiency for large data sets, but it does require a worst-case traversal of the entire data set from start to finish (a side-effect, I would call it, of not being able to use ordering; we must compare every element to other elements until we find a match). The generation of two linked lists allows us to avoid all-to-all comparisons.
Once we have the lists (right-neighbor and left-neighbor) built from step 1, it's an easy matter to count the number of unique objects, using thrust::count.
We then get the starting indexes of each unique element (i.e. the leftmost index of each type of unique element, in the dataset), using thrust::copy_if stream compaction.
The next step is to count the number of instances of each of the unique elements. This step is doing list traversal, one thread per element list. If I have a small number of unique elements, this will not effectively utilize the GPU. In addition, the list traversal will result in lousy access patterns.
After we have counted the number of each type of object, we can then build a sequence of starting indices for each object type in the output list, via thrust::exclusive_scan on the numbers of each type of object.
Finally, we can copy each input element to it's appropriate place in the output list. Since we have no way to group or order the elements yet, we must again resort to list traversal. Once again, this will be inefficient use of the GPU if the number of unique object types is small, and will also have lousy memory access patterns.
Here's a fully worked example, using your sample data set of characters. To help clarify the idea that we intend to group objects that have no inherent ordering, I have created a somewhat arbitrary object definition (my_obj), that has the == comparison operator defined, but no definition for < or >.
$ cat t707.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/for_each.h>
#include <thrust/transform.h>
#include <thrust/transform_scan.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
#include <thrust/count.h>
#include <iostream>
template <typename T>
class my_obj
{
T element;
int index;
public:
__host__ __device__ my_obj() : element(0), index(0) {};
__host__ __device__ my_obj(T a) : element(a), index(0) {};
__host__ __device__ my_obj(T a, int idx) : element(a), index(idx) {};
__host__ __device__
T get() {
return element;}
__host__ __device__
void set(T a) {
element = a;}
__host__ __device__
int get_idx() {
return index;}
__host__ __device__
void set_idx(int idx) {
index = idx;}
__host__ __device__
bool operator ==(my_obj &e2)
{
return (e2.get() == this->get());
}
};
template <typename T>
struct search_functor
{
my_obj<T> *data;
int end;
int *rn;
int *ln;
search_functor(my_obj<T> *_a, int *_rn, int *_ln, int len) : data(_a), rn(_rn), ln(_ln), end(len) {};
__host__ __device__
void operator()(int idx){
for (int i = idx+1; i < end; i++)
if (data[idx] == data[i]) {
ln[i] = idx;
rn[idx] = i;
return;}
return;
}
};
template <typename T>
struct copy_functor
{
my_obj<T> *data;
my_obj<T> *result;
int *rn;
copy_functor(my_obj<T> *_in, my_obj<T> *_out, int *_rn) : data(_in), result(_out), rn(_rn) {};
__host__ __device__
void operator()(const thrust::tuple<int, int> &t1) const {
int idx1 = thrust::get<0>(t1);
int idx2 = thrust::get<1>(t1);
result[idx1] = data[idx2];
int i = rn[idx2];
int j = 1;
while (i != -1){
result[idx1+(j++)] = data[i];
i = rn[i];}
return;
}
};
struct count_functor
{
int *rn;
int *ot;
count_functor(int *_rn, int *_ot) : rn(_rn), ot(_ot) {};
__host__ __device__
int operator()(int idx1, int idx2){
ot[idx1] = idx2;
int i = rn[idx1];
int count = 1;
while (i != -1) {
ot[i] = idx2;
count++;
i = rn[i];}
return count;
}
};
using namespace thrust::placeholders;
int main(){
// data setup
char data[] = { 'a' , 'e' , 't' , 'b' , 'c' , 'a' , 'c' , 'e' , 't' , 'a' };
int sz = sizeof(data)/sizeof(char);
for (int i = 0; i < sz; i++) std::cout << data[i] << ",";
std::cout << std::endl;
thrust::host_vector<my_obj<char> > h_data(sz);
for (int i = 0; i < sz; i++) { h_data[i].set(data[i]); h_data[i].set_idx(i); }
thrust::device_vector<my_obj<char> > d_data = h_data;
// create left and right neighbor indices
thrust::device_vector<int> ln(d_data.size(), -1);
thrust::device_vector<int> rn(d_data.size(), -1);
thrust::for_each(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(0) + sz, search_functor<char>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(rn.data()), thrust::raw_pointer_cast(ln.data()), d_data.size()));
// determine number of unique objects
int uni_objs = thrust::count(ln.begin(), ln.end(), -1);
// determine the number of instances of each unique object
// get object starting indices
thrust::device_vector<int> uni_obj_idxs(uni_objs);
thrust::copy_if(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(0)+d_data.size(), ln.begin(), uni_obj_idxs.begin(), (_1 == -1));
// count each object list
thrust::device_vector<int> num_objs(uni_objs);
thrust::device_vector<int> obj_type(d_data.size());
thrust::transform(uni_obj_idxs.begin(), uni_obj_idxs.end(), thrust::counting_iterator<int>(0), num_objs.begin(), count_functor(thrust::raw_pointer_cast(rn.data()), thrust::raw_pointer_cast(obj_type.data())));
// at this point, we have built object lists that have allowed us to identify a unique, orderable "type" for each object
// the sensible thing to do would be to employ a sort_by_key on obj_type and an index sequence at this point
// and use the reordered index sequence to reorder the original objects, thus grouping them
// however... without sorting...
// build output vector indices
thrust::device_vector<int> copy_start(num_objs.size());
thrust::exclusive_scan(num_objs.begin(), num_objs.end(), copy_start.begin());
// copy (by object type) input to output
thrust::device_vector<my_obj<char> > d_result(d_data.size());
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(copy_start.begin(), uni_obj_idxs.begin())), thrust::make_zip_iterator(thrust::make_tuple(copy_start.end(), uni_obj_idxs.end())), copy_functor<char>(thrust::raw_pointer_cast(d_data.data()), thrust::raw_pointer_cast(d_result.data()), thrust::raw_pointer_cast(rn.data())));
// display results
std::cout << "Grouped: " << std::endl;
for (int i = 0; i < d_data.size(); i++){
my_obj<char> temp = d_result[i];
std::cout << temp.get() << ",";}
std::cout << std::endl;
for (int i = 0; i < d_data.size(); i++){
my_obj<char> temp = d_result[i];
std::cout << temp.get_idx() << ",";}
std::cout << std::endl;
return 0;
}
$ nvcc -o t707 t707.cu
$ ./t707
a,e,t,b,c,a,c,e,t,a,
Grouped:
a,a,a,e,e,t,t,b,c,c,
0,5,9,1,7,2,8,3,4,6,
$
In the discussion we found out that your real goal is to eliminate duplicates in a vector of float4 elements.
In order to apply thrust::unique the elements need to be sorted.
So you need a sort method for 4 dimensional data. This can be done using space-filling curves. I have previously used the z-order curve (aka morton code) to sort 3D data. There are efficient CUDA implementations for the 3D case available, however quick googling did not return a ready-to-use implementation for the 4D case.
I found a paper which lists a generic algorithm for sorting n-dimensional data points using the z-order curve:
Fast construction of k-Nearest Neighbor Graphs for Point Clouds
(see Algorithm 1 : Floating Point Morton Order Algorithm).
There is also a C++ implementation available for this algorithm.
For 4D data, the loop could be unrolled, but there might be simpler and more efficient algorithms available.
So the (not fully implemented) sequence of operations would then look like this:
#include <thrust/device_vector.h>
#include <thrust/unique.h>
#include <thrust/sort.h>
inline __host__ __device__ float dot(const float4& a, const float4& b)
{
return a.x * b.x + a.y * b.y + a.z * b.z + a.w * b.w;
}
struct identity_4d
{
__host__ __device__
bool operator()(const float4& a, const float4& b) const
{
// based on the norm function you provided in the discussion
return dot(a,b) < (0.1f*0.1f);
}
};
struct z_order_4d
{
__host__ __device__
bool operator()(const float4& p, const float4& q) const
{
// you need to implement the z-order algorithm here
// ...
}
};
int main()
{
const int N = 100;
thrust::device_vector<float4> data(N);
// fill the data
// ...
thrust::sort(data.begin(),data.end(), z_order_4d());
thrust::unique(data.begin(),data.end(), identity_4d());
}

Thrust vector transformation involving neighbor elements

I have a vector, and I would like to do the following, using CUDA and Thrust transformations:
// thrust::device_vector v;
// for k times:
// calculate constants a and b as functions of k;
// for (i=0; i < v.size(); i++)
// v[i] = a*v[i] + b*v[i+1];
How should I correctly implement this? One way I can do it is to have vector w, and apply thrust::transform onto v and save the results to w. But k is unknown ahead of time, and I don't want to create w1, w2, ... and waste a lot of GPU memory space. Preferably I want to minimize the amount of data copying. But I'm not sure how to implement this using one vector without the values stepping on each other. Is there something Thrust provides that can do this?
If the v.size() is large enough to fully utilize the GPU, you could launch k kernels to do this, with a extra buffer mem and no extra data transfer.
thrust::device_vector u(v.size());
for(k=0;;)
{
// calculate a & b
thrust::transform(v.begin(), v.end()-1, v.begin()+1, u.begin(), a*_1 + b*_2);
k++;
if(k>=K)
break;
// calculate a & b
thrust::transform(u.begin(), u.end()-1, u.begin()+1, v.begin(), a*_1 + b*_2);
k++;
if(k>=K)
break;
}
I don't actually understand the "k times", but the following code may help you.
struct OP {
const int a, b;
OP(const int p, const int q): a(p), b(q){};
int operator()(const int v1, const int v2) {
return a*v1+b*v2;
}
}
thrust::device_vector<int> w(v.size());
thrust::transform(v.begin(), v.end()-1, //input_1
v.begin()+1, //input_2
w.begin(), //output
OP(a, b)); //functor
v = w;
I think learning about "functor", and several examples of thrust will give you a good guide.
Hope this will help you to solve your problem. :)

CUDA: Max of array, how to prevent write collisions?

I have an array of doubles stored in GPU global memory and i need to find the maximum value in it. I have read some texts about parallel reduction, so i know that one should divide the array between blocks and make them find their "global maximum", and so on.
But they never seem to address the issue of threads trying to write to the same memory position simultaneously.
Let's say that local_max=0.0 in the beginning of a block execution. Then each thread reads their value from the input vector, decides that is larger than local_max, and then try to write their value to local_max. When all of this happens at the exact same time (atleast when inside the same warp), how can this work and end up with the actual maximum within this block?
I would think either an atomic function or some kind of lock or critical section would be needed, but i haven't seen this addressed in the answers i have found. (ex http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf )
The answer to your questions are contained in the very document you linked to, and the SDK reduction example shows concrete implementations of the reduction concept.
For completeness, here is a concrete example of a reduction kernel:
template <typename T, int BLOCKSIZE>
__global__ reduction(T *inputvals, T *outputvals, int N)
{
__shared__ volatile T data[BLOCKSIZE];
T maxval = inputvals[threadIdx.x];
for(int i=blockDim.x + threadIdx.x; i<N; i+=blockDim.x)
{
maxfunc(maxval, inputvals[i]);
}
data[threadIdx.x] = maxval;
__syncthreads();
// Here maxfunc(a,b) sets a to the minimum of a and b
if (threadIdx.x < 32) {
for(int i=32+threadIdx.x; i < BLOCKSIZE; i+= 32) {
maxfunc(data[threadIdx.x], data[i]);
}
if (threadIdx.x < 16) maxfunc(data[threadIdx.x], data[threadIdx.x+16]);
if (threadIdx.x < 8) maxfunc(data[threadIdx.x], data[threadIdx.x+8]);
if (threadIdx.x < 4) maxfunc(data[threadIdx.x], data[threadIdx.x+4]);
if (threadIdx.x < 2) maxfunc(data[threadIdx.x], data[threadIdx.x+2]);
if (threadIdx.x == 0) {
maxfunc(data[0], data[1]);
outputvals[blockIdx.x] = data[0];
}
}
}
The key point is using the synchronization that is implicit within a warp to perform the reduction in shared memory. The result is a single per-block maximum value. A second reduction pass is required to reduce the set of block maximums to the global maximum (often it is faster to o this on the host). In this example, maxvals is the "compare and set" function which could be as simple as
template<T>
__device__ void maxfunc(T & a, T & b)
{
a = (b > a) ? b : a;
}
Dont' cook your own code, use some thrust (included in version 4.0 of the Cuda sdk) :
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <iostream>
int main(void)
{
thrust::host_vector<int> h_vec(10000);
thrust::sequence(h_vec.begin(), h_vec.end());
// show hvec
thrust::copy(h_vec.begin(), h_vec.end(),
std::ostream_iterator<int>(std::cout, "\n"));
// transfer to device
thrust::device_vector<int> d_vec = h_vec;
int max_dvec_value = *thrust::max_element(d_vec.begin(), d_vec.end());
std::cout << "max value: " << max_dvec_value << "\n";
return 0;
}
And watch out that thrust::max_element returns a pointer.
Your question is clearly answered in the document you link to. I think you just need to spend some more time reading it and understanding the CUDA concepts used in it. In particular, I would focus on shared memory, the __syncthreads() method, and how to uniquely identify a thread while inside a kernel. Additionally, you should try to understand why the reduction may need to be run in 2 passes to find the global maximum.