Can Thrust transform_reduce work with 2 arrays? - cuda

I found what Thrust can provide is quite limited, as below code shows:
I end up to have 9*9*2 (1 multiple + 1 reduce) Thrust calls, which is 162 kernel launches.
While if I write my own kernel, only 1 kernel launch needed.
for(i=1;i<=9;i++)
{
for(j=i;j<=9;j++)
{
ATA[i][j]=0;
for(m=1;m<=50000;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
Then I end up with below Thrust implementation:
for(i=1;i<=dim0;i++)
{
for(j=i;j<=dim0;j++)
{
thrust::transform(t_d_X+(idx0[i]-1)*(1+iNumPaths)+1, t_d_X+(idx0[i]-1)*(1+iNumPaths)+iNumPaths+1, t_d_X+(idx0[j]-1)*(1+iNumPaths)+1,t_d_cdataMulti, thrust::multiplies<double>());
ATA[i][j] = thrust::reduce(t_d_cdataMulti, t_d_cdataMulti+iNumPaths, (double) 0, thrust::plus<double>());
}
}
Some analysis:
transform_reduce: will NOT help, as there is a pointer redirect idx0[i], and basically there are 2 arrays involved. 1st one is X[idx0[i]], 2nd one is X[idx0[j]]
reduce_by_key: will help. But I need to store all interim results into one big array, and prepare a huge mapping key table with same size. Will try it out.
transform_iterator: will NOT help, same reason as 1.
Think I can't avoid writing my own kernel?

I'll bet #m.s. can provide a more efficient approach. But here is one possible approach. In order to get the entire computation reduced to a single kernel call by thrust, it is necessary to handle everything with a single thrust algorithm call. At the heart of the operation, we are summing many computations together, to fill a matrix. Therefore I believe thrust::reduce_by_key is an appropriate thrust algorithm to use. This means we must realize all other transformations using various thrust "fancy iterators", which are mostly covered in the thrust getting started guide.
Attempting to do this (handle everything with a single kernel call) makes the code very dense and hard to read. I don't normally like to demonstrate thrust this way, but since it is the crux of your question, it cannot be avoided. Therefore let's unpack the sequence of operations contained in the call to reduce_by_key, approximately from the inward out. The general basis of this algorithm is to "flatten" all data into a single long logical vector. Let's assume for understanding that our square matrix dimensions are only 2x2 and the length of our m vector is 3. You can think of the "flattening" or linear-index conversion like this:
linear index: 0 1 2 3 4 5 6 7 8 9 10 11
i index: 0 0 0 0 0 0 1 1 1 1 1 1
j index: 0 0 0 1 1 1 0 0 0 1 1 1
m index: 0 1 2 0 1 2 0 1 2 0 1 2
k index: 0 0 0 1 1 1 2 2 2 3 3 3
The "k index" above is our keys that will ultimately be used by reduce_by_key to collect product terms together, for each element of the matrix. Note that the code has EXT_I, EXT_J, EXT_M, and EXT_K helper macros which will define, using thrust placeholders, the operation to be performed on the linear index (created using a counting_iterator) to produce the various other "indices".
The first thing we will need to do is construct a suitable thrust operation to convert the linear index into the transformed value of idx0[i] (again, working from "inward to outward"). We can do this with a permutation iterator on idx0 vector, with a transform_iterator supplying the "map" for the permutation iterator - this transform iterator just converts the linear index (mb) to an "i" index:
thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I))
Now we need to combine the result from step 1 with the other index - m in this case, to generate a linearized version of the 2D index into X (d_X is the vector-linearized version of X). To do this, we will combine the result of step one in a zip_iterator with another transform iterator that creates the m index. This zip_iterator will be passed to a transform_iterator which takes the two indices and converts it into a linearized index to "look into" the d_X vector:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx()))
create_Xidx is the functor that takes the two computed indices and converts it into the linear index into d_X
With the result from step 2, we can then use a permutation iterator to grab the appropriate value from d_X for the first term in the multiplication:
thrust::make_permutation_iterator(d_X.begin(), {code from step 2})
repeat steps 1,2,3, using EXT_J instead of EXT_I, to create the second term in the multiplication:
X[idx0[i]][m]*X[idx0[j]][m]
Place the terms created in step 3 and 4 into a zip_iterator, for use by the transform_iterator that will multiply the two together (using my_mult functor) to create the actual product:
thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple({result from step 3}, {result from step 4}, my_mult())
The remainder of the reduce_by_key is fairly straightforward. We create the keys index as described previously, and then use it to sum together the various products for each element of the square matrix.
Here is a fully worked example:
$ cat t875.cu
#include <iostream>
#include <thrust/reduce.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/permutation_iterator.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
// rows
#define D1 9
// cols
#define D2 9
// size of m
#define D3 50
// helpers to convert linear indices to i,j,m or "key" indices
#define EXT_I (_1/(D2*D3))
#define EXT_J ((_1/(D3))%D2)
#define EXT_M (_1%D3)
#define EXT_K (_1/D3)
void test_cpu(float ATA[][D2], float X[][D3], int idx0[]){
for(int i=0;i<D1;i++)
{
for(int j=0;j<D2;j++)
{
ATA[i][j]=0;
for(int m=0;m<D3;m++)
ATA[i][j]=ATA[i][j]+X[idx0[i]][m]*X[idx0[j]][m];
}
}
}
using namespace thrust::placeholders;
struct create_Xidx : public thrust::unary_function<thrust::tuple<int, int>, int>{
__host__ __device__
int operator()(thrust::tuple<int, int> &my_tuple){
return (thrust::get<0>(my_tuple) * D3) + thrust::get<1>(my_tuple);
}
};
struct my_mult : public thrust::unary_function<thrust::tuple<float, float>, float>{
__host__ __device__
float operator()(thrust::tuple<float, float> &my_tuple){
return thrust::get<0>(my_tuple) * thrust::get<1>(my_tuple);
}
};
int main(){
//synthesize data
float ATA[D1][D2];
float X[D1][D3];
int idx0[D1];
thrust::host_vector<float> h_X(D1*D3);
thrust::host_vector<int> h_idx0(D1);
for (int i = 0; i < D1; i++){
idx0[i] = (i + 2)%D1; h_idx0[i] = idx0[i];
for (int j = 0; j < D2; j++) {ATA[i][j] = 0;}
for (int j = 0; j < D3; j++) {X[i][j] = j%(i+1); h_X[i*D3+j] = X[i][j];}}
thrust::device_vector<float> d_ATA(D1*D2);
thrust::device_vector<float> d_X = h_X;
thrust::device_vector<int> d_idx0 = h_idx0;
// helpers
thrust::counting_iterator<int> mb = thrust::make_counting_iterator(0);
thrust::counting_iterator<int> me = thrust::make_counting_iterator(D1*D2*D3);
// perform computation
thrust::reduce_by_key(thrust::make_transform_iterator(mb, EXT_K), thrust::make_transform_iterator(me, EXT_K), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_I)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())), thrust::make_permutation_iterator(d_X.begin(), thrust::make_transform_iterator(thrust::make_zip_iterator(thrust::make_tuple(thrust::make_permutation_iterator(d_idx0.begin(), thrust::make_transform_iterator(mb, EXT_J)), thrust::make_transform_iterator(mb, EXT_M))), create_Xidx())))), my_mult()), thrust::make_discard_iterator(), d_ATA.begin());
thrust::host_vector<float> h_ATA = d_ATA;
test_cpu(ATA, X, idx0);
std::cout << "GPU: CPU: " << std::endl;
for (int i = 0; i < D1*D2; i++)
std::cout << i/D1 << "," << i%D2 << ":" << h_ATA[i] << " " << ATA[i/D1][i%D2] << std::endl;
}
$ nvcc -o t875 t875.cu
$ ./t875
GPU: CPU:
0,0:81 81
0,1:73 73
0,2:99 99
0,3:153 153
0,4:145 145
0,5:169 169
0,6:219 219
0,7:0 0
0,8:25 25
1,0:73 73
1,1:169 169
1,2:146 146
1,3:193 193
1,4:212 212
1,5:313 313
1,6:280 280
1,7:0 0
1,8:49 49
2,0:99 99
2,1:146 146
2,2:300 300
2,3:234 234
2,4:289 289
2,5:334 334
2,6:390 390
2,7:0 0
2,8:50 50
3,0:153 153
3,1:193 193
3,2:234 234
3,3:441 441
3,4:370 370
3,5:433 433
3,6:480 480
3,7:0 0
3,8:73 73
4,0:145 145
4,1:212 212
4,2:289 289
4,3:370 370
4,4:637 637
4,5:476 476
4,6:547 547
4,7:0 0
4,8:72 72
5,0:169 169
5,1:313 313
5,2:334 334
5,3:433 433
5,4:476 476
5,5:841 841
5,6:604 604
5,7:0 0
5,8:97 97
6,0:219 219
6,1:280 280
6,2:390 390
6,3:480 480
6,4:547 547
6,5:604 604
6,6:1050 1050
6,7:0 0
6,8:94 94
7,0:0 0
7,1:0 0
7,2:0 0
7,3:0 0
7,4:0 0
7,5:0 0
7,6:0 0
7,7:0 0
7,8:0 0
8,0:25 25
8,1:49 49
8,2:50 50
8,3:73 73
8,4:72 72
8,5:97 97
8,6:94 94
8,7:0 0
8,8:25 25
$
Notes:
If you profile the above code with e.g. nvprof --print-gpu-trace ./t875, you will witness two kernel calls. The first is associated with the device_vector creation. The second kernel call handles the entire reduce_by_key operation.
I don't know if all this is slower or faster than your CUDA kernel, since you haven't provided it. Sometimes, expertly written CUDA kernels can be faster than thrust algorithms doing the same operation.
It's quite possible that what I have here is not precisely the algorithm you had in mind. For example, your code suggests you're only filling in a triangular portion of ATA. But your description (9*9*2) suggests you want to populate every position in ATA. Nevertheless, my intent is not to give you a black box but to demonstrate how you can use various thrust approaches to achieve whatever it is you want in a single kernel call.

Related

Use thrust to find element in groups

I have two int vectors for keys and values, their size is about 500K.
The key vector is already sorted. And there are 10K groups approximately.
The value is non-negative(stands for useful) or -2(stands for no use), in each group there should be one or zero non-negative values, and the rest is -2.
key: 0 0 0 0 1 2 2 3 3 3 3
value:-2 -2 1 -2 3 -2 -2 -2 -2 -2 0
The third pair of group 0 [0 1] is useful. For group 1 we get the pair[1 3]. The values of group 2 are all -2, so we get nothing. And for group 3, the result is [3 0].
So, the question is how can I do this by thrust or cuda ?
Here are two ideas.
First one:
Get the number of each group by a histogram algorithm. So the barrier of each group can be computed.
Operate thrust::find_if on each group to get the useful element.
Second one:
Use thrust::transform to add 2 for every value and now all the value are non-negative, and zero stands for useless.
Use thrust::reduce_by_key to get the reduction for every group, and then subtract 2 for every output value.
I think there must be some other methods which will achieve much more performance than the above two.
Performance of the methods:
I have test the Second method above and the method given by #Robert Crovella, ie. reduce_by_key and remove_if method.
The size of the vectors is 2691028, the vectors consist of 100001 groups. Here is their average time:
reduce_by_key: 1204ms
remove_if: 192ms
From above result, we can see that remove_if method is much faster. And also the "remove_if" method is easy to implement and consume much less gpu memory.
Briefly, #Robert Crovella 's method is very good.
I would use a thrust::zip_iterator to zip the key and values pairs together, and then I would do a thrust::remove_if operation on the zipped values which would require a functor definition that would indicate to remove every pair for which the value is negative (or whatever test you wish.)
Here's a worked example:
$ cat t1009.cu
#include <thrust/remove.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/copy.h>
struct remove_func
{
template <typename T>
__host__ __device__
bool operator()(T &t){
return (thrust::get<1>(t) < 0); // could change to other kinds of tests
}
};
int main(){
int keys[] = {0,0,0,0,1,2,2,3,3,3,3};
int vals[] = {-2,-2,1,-2,3,-2,-2,-2,-2,-2,0};
size_t dsize = sizeof(keys)/sizeof(int);
thrust::device_vector<int>dkeys(keys, keys+dsize);
thrust::device_vector<int>dvals(vals, vals+dsize);
auto zr = thrust::make_zip_iterator(thrust::make_tuple(dkeys.begin(), dvals.begin()));
size_t rsize = thrust::remove_if(zr, zr+dsize, remove_func()) - zr;
thrust::copy_n(dkeys.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
thrust::copy_n(dvals.begin(), rsize, std::ostream_iterator<int>(std::cout, ","));
std::cout << std::endl;
return 0;
}
$ nvcc -std=c++11 -o t1009 t1009.cu
$ ./t1009
0,1,3,
1,3,0,
$

Clarification on the flow of realtime work to a GPU

I just started learning CUDA, and I am confused by one point. For the sake of argument, imagine that I had several hundred buoys in the ocean. Imagine that they broadcast a std::vector intermittently once every few milliseconds. The vector might be 5 readings, or 10 readings, etc, depending on the conditions in the ocean at that time. There is no way to tell when the event will fire, it is not deterministic.
Imagine that I had the idea that I could predict the temperature from gathering all this information in realtime, but that the predictor had to first sort all std::vectos on temperature accross all buoy. My question is this. Do I have to copy the entire data back to the GPU every time a single buyoy fires an event? Since the other buoy's data has not changed, can I leave that data in the GPU and just update what has changed and ask the kernel to rerun the prediction?
If yes, what is the [thrust pseudo]-code that would do this? Is this best done with streams and events and pinned memory? What is the limit as to how fast I can update the GPU with realtime data?
I was told that this sort of problem is not well suited to GPU and better in FPGA.
A basic sequence could be like this.
Setup phase (initial sort):
Gather an initial set of vectors from each buoy.
Create a parallel set of vectors, one for each buoy, of length equal to the initial length of the buoy vector, and popluated by the buoy index:
b1: 1.5 1.7 2.2 2.3 2.6
i1: 1 1 1 1 1
b2: 2.4 2.5 2.6
i2: 2 2 2
b3: 2.8
i3: 3
Concatenate all vectors into a single buoy-temp-vector and buoy-index-vector:
b: 1.5 1.7 2.2 2.3 2.6 2.4 2.5 2.6 2.8
i: 1 1 1 1 1 2 2 2 3
Sort-by-key:
b: 1.5 1.7 2.2 2.3 2.4 2.5 2.6 2.6 2.8
i: 1 1 1 1 2 2 1 2 3
The setup phase is complete. The update phase is executed whenever a buoy update is received. Suppose buoy 2 sends an update:
b2: 2.5 2.7 2.9 3.0
Do thrust::remove_if on the buoy vector, if the corresponding index vector position holds the updated buoy number (2 in this case). Repeat the remove_if on the index vector using the same rule:
b: 1.5 1.7 2.2 2.3 2.6 2.8
i: 1 1 1 1 1 3
Generate the corresponding index vector for the buoy to be updated, and copy both vectors (buoy 2 temp-value and index vectors) to the device:
b2: 2.5 2.7 2.9 3.0
i2: 2 2 2 2
Do thrust::merge_by_key on the newly received update from buoy 2
b: 1.5 1.7 2.2 2.3 2.5 2.6 2.7 2.8 2.9 3.0
i: 1 1 1 1 2 1 2 3 2 2
The only data that has to be copied to the device on an update cycle is the actual buoy data to be updated. Note that with some work, the setup phase could be eliminated, and the initial assembly of the vectors could be merely seen as "updates" from each buoy, into initially-empty buoy value and buoy index vectors. But for description, it's easier to visualize with a setup phase, I think. The above description doesn't explicitly point out the various vector sizings and resizings needed, but this can be accomplished using the same methods one would use on std::vector. Vector resizing may be "costly" on the GPU, just as it can be "costly" on the CPU (if a resize to larger triggers a new allocation and copy...) but this could also be elmiminated if a max number of buoys is known and a max number of elements per update is known. In that case, we could allocate our overall buoy value and buoy index vector to be the maximum necessary sizes.
Here is a fully-worked example following the above outline. As a placeholder, I have included a dummy prediction_kernel call, showing where you could insert your specialized prediction code, operating on the sorted data.
#include <stdio.h>
#include <stdlib.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sort.h>
#include <thrust/merge.h>
#include <sys/time.h>
#include <time.h>
#define N_BUOYS 1024
#define N_MAX_UPDATE 1024
#define T_RANGE 100
#define N_UPDATES_TEST 1000
struct equal_func{
const int idx;
equal_func(int _idx) : idx(_idx) {}
__host__ __device__
bool operator()(int test_val) {
return (test_val == idx);
}
};
__device__ float dev_result[N_UPDATES_TEST];
// dummy "prediction" kernel
__global__ void prediction_kernel(const float *data, int iter, size_t d_size){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
if (idx == 0) dev_result[iter] = data[d_size/2];
}
void create_vec(unsigned int id, thrust::host_vector<float> &data, thrust::host_vector<int> &idx){
size_t mysize = rand()%N_MAX_UPDATE;
data.resize(mysize);
idx.resize(mysize);
for (int i = 0; i < mysize; i++){
data[i] = ((float)rand()/(float)RAND_MAX)*(float)T_RANGE;
idx[i] = id;}
thrust::sort(data.begin(), data.end());
}
int main(){
timeval t1, t2;
int pp = 0;
// ping-pong processing vectors
thrust::device_vector<float> buoy_data[2];
buoy_data[0].resize(N_BUOYS*N_MAX_UPDATE);
buoy_data[1].resize(N_BUOYS*N_MAX_UPDATE);
thrust::device_vector<int> buoy_idx[2];
buoy_idx[0].resize(N_BUOYS*N_MAX_UPDATE);
buoy_idx[1].resize(N_BUOYS*N_MAX_UPDATE);
// vectors for initial buoy data
thrust::host_vector<float> h_buoy_data[N_BUOYS];
thrust::host_vector<int> h_buoy_idx[N_BUOYS];
//SETUP
// populate initial data
int lidx=0;
for (int i = 0; i < N_BUOYS; i++){
create_vec(i, h_buoy_data[i], h_buoy_idx[i]);
thrust::copy(h_buoy_data[i].begin(), h_buoy_data[i].end(), &(buoy_data[pp][lidx]));
thrust::copy(h_buoy_idx[i].begin(), h_buoy_idx[i].end(), &(buoy_idx[pp][lidx]));
lidx+= h_buoy_data[i].size();}
// sort initial data
thrust::sort_by_key(&(buoy_data[pp][0]), &(buoy_data[pp][lidx]), &(buoy_idx[pp][0]));
//UPDATE CYCLE
gettimeofday(&t1, NULL);
for (int i = 0; i < N_UPDATES_TEST; i++){
unsigned int vec_to_update = rand()%N_BUOYS;
int nidx = lidx - h_buoy_data[vec_to_update].size();
create_vec(vec_to_update, h_buoy_data[vec_to_update], h_buoy_idx[vec_to_update]);
thrust::remove_if(&(buoy_data[pp][0]), &(buoy_data[pp][lidx]), buoy_idx[pp].begin(), equal_func(vec_to_update));
thrust::remove_if(&(buoy_idx[pp][0]), &(buoy_idx[pp][lidx]), equal_func(vec_to_update));
lidx = nidx + h_buoy_data[vec_to_update].size();
thrust::device_vector<float> temp_data = h_buoy_data[vec_to_update];
thrust::device_vector<int> temp_idx = h_buoy_idx[vec_to_update];
int ppn = (pp == 0)?1:0;
thrust::merge_by_key(&(buoy_data[pp][0]), &(buoy_data[pp][nidx]), temp_data.begin(), temp_data.end(), buoy_idx[pp].begin(), temp_idx.begin(), buoy_data[ppn].begin(), buoy_idx[ppn].begin() );
pp = ppn; // update ping-pong buffer index
prediction_kernel<<<1,1>>>(thrust::raw_pointer_cast(buoy_data[pp].data()), i, lidx);
}
gettimeofday(&t2, NULL);
unsigned int tdiff_us = ((t2.tv_sec*1000000)+t2.tv_usec) - ((t1.tv_sec*1000000)+t1.tv_usec);
printf("Completed %d updates in %f sec\n", N_UPDATES_TEST, (float)tdiff_us/(float)1000000);
// float *temps = (float *)malloc(N_UPDATES_TEST*sizeof(float));
// cudaMemcpyFromSymbol(temps, dev_result, N_UPDATES_TEST*sizeof(float));
// for (int i = 0; i < 100; i++) printf("temp %d: %f\n", i, temps[i]);
return 0;
}
Using CUDA 6, on linux, on a Quadro 5000 GPU, 1000 "updates" requires about 2 seconds. The majority of the time is spent in the calls to thrust::remove_if and thrust::merge_by_key I suppose for worst case real-time estimation, you would want to try and time the worst case update, which might be something like receiving a longest-possible update.

Thrust vector transformation involving removing vector elements

I have a thrust device_vector divided into chunks of 100 (but altogether contiguous on GPU memory), and i want to remove the last 5 elements of each chunk, without having to reallocate a new device_vector to copy it into.
// Layout in memory before (number of elements in each contiguous subblock listed):
// [ 95 | 5 ][ 95 | 5 ][ 95 | 5 ]........
// Layout in memory after cutting out the last 5 of each chunk (number of elements listed)
// [ 95 ][ 95 ][ 95 ].........
thrust::device_vector v;
// call some function on v;
// so elements 95-99, 195-99, 295-299, etc are removed (assuming 0-based indexing)
How can I correctly implement this? Preferably I would like to avoid allocating a new vector in GPU memory to save the transform into. I understand there are Thrust template functions for dealing with these kinds of operations, but I have trouble stringing them together. Is there something Thrust provides that can do this?
No allocation of the buffer mem means you have to preserve the copying order, which can not be paralleled to fully utilize the GPU hardware.
Here's a version for doing this using Thrust with a buffer mem.
It requires Thrust 1.6.0+ since the lambda expression functor is used on iterators.
#include "thrust/device_vector.h"
#include "thrust/iterator/counting_iterator.h"
#include "thrust/iterator/permutation_iterator.h"
#include "thrust/iterator/transform_iterator.h"
#include "thrust/copy.h"
#include "thrust/functional.h"
using namespace thrust::placeholders;
int main()
{
const int oldChunk = 100, newChunk = 95;
const int size = 10000;
thrust::device_vector<float> v(
thrust::counting_iterator<float>(0),
thrust::counting_iterator<float>(0) + oldChunk * size);
thrust::device_vector<float> buf(newChunk * size);
thrust::copy(
thrust::make_permutation_iterator(
v.begin(),
thrust::make_transform_iterator(
thrust::counting_iterator<int>(0),
_1 / newChunk * oldChunk + _1 % newChunk)),
thrust::make_permutation_iterator(
v.begin(),
thrust::make_transform_iterator(
thrust::counting_iterator<int>(0),
_1 / newChunk * oldChunk + _1 % newChunk))
+ buf.size(),
buf.begin());
return 0;
}
I think the above version may not achieve the highest performance due to the use of mod operator %. For higher performance you may consider the cuBLAS function cublas_geam()
float alpha = 1;
float beta = 0;
cublasSgeam(handle, CUBLAS_OP_N, CUBLAS_OP_N,
newChunk, size,
&alpha,
thrust::raw_pointer_cast(&v[0]), oldChunk,
&beta,
thrust::raw_pointer_cast(&v[0]), oldChunk,
thrust::raw_pointer_cast(&buf[0]), newChunk);

Converting thrust::iterators to and from raw pointers

I want to use Thrust library to calculate prefix sum of device array in CUDA.
My array is allocated with cudaMalloc(). My requirement is as follows:
main()
{
Launch kernel 1 on data allocated through cudaMalloc()
// This kernel will poplulate some data d.
Use thrust to calculate prefix sum of d.
Launch kernel 2 on prefix sum.
}
I want to use Thrust somewhere between my kernels so I need method to convert pointers to device iterators and back.What is wrong in following code?
int main()
{
int *a;
cudaMalloc((void**)&a,N*sizeof(int));
thrust::device_ptr<int> d=thrust::device_pointer_cast(a);
thrust::device_vector<int> v(N);
thrust::exclusive_scan(a,a+N,v);
return 0;
}
A complete working example from your latest edit would look like this:
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/fill.h>
#include <thrust/copy.h>
#include <cstdio>
int main()
{
const int N = 16;
int * a;
cudaMalloc((void**)&a, N*sizeof(int));
thrust::device_ptr<int> d = thrust::device_pointer_cast(a);
thrust::fill(d, d+N, 2);
thrust::device_vector<int> v(N);
thrust::exclusive_scan(d, d+N, v.begin());
int v_[N];
thrust::copy(v.begin(), v.end(), v_);
for(int i=0; i<N; i++)
printf("%d %d\n", i, v_[i]);
return 0;
}
The things you got wrong:
N not defined anywhere
passing the raw device pointer a rather than the device_ptr d as the input iterator to exclusive_scan
passing the device_vector v to exclusive_scan rather than the appropriate iterator v.begin()
Attention to detail was all that is lacking to make this work. And work it does:
$ nvcc -arch=sm_12 -o thrust_kivekset thrust_kivekset.cu
$ ./thrust_kivekset
0 0
1 2
2 4
3 6
4 8
5 10
6 12
7 14
8 16
9 18
10 20
11 22
12 24
13 26
14 28
15 30
Edit:
thrust::device_vector.data() will return a thrust::device_ptr which points to the first element of the vector. thrust::device_ptr.get() will return a raw device pointer. Therefore
cudaMemcpy(v_, v.data().get(), N*sizeof(int), cudaMemcpyDeviceToHost);
and
thrust::copy(v, v+N, v_);
are functionally equivalent in this example.
Convert your raw pointer obtained from cudaMalloc() to a thrust::device_ptr using thrust::device_pointer_cast. Here's an example from the Thrust docs:
#include <thrust/device_ptr.h>
#include <thrust/fill.h>
#include <cuda.h>
int main(void)
{
size_t N = 10;
// obtain raw pointer to device memory
int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));
// wrap raw pointer with a device_ptr
thrust::device_ptr<int> dev_ptr = thrust::device_pointer_cast(raw_ptr);
// use device_ptr in Thrust algorithms
thrust::fill(dev_ptr, dev_ptr + N, (int) 0);
// access device memory transparently through device_ptr
dev_ptr[0] = 1;
// free memory
cudaFree(raw_ptr);
return 0;
}
Use thrust::inclusive_scan or thrust::exclusive_scan to compute the prefix sum.
http://code.google.com/p/thrust/wiki/QuickStartGuide#Prefix-Sums

Counting occurrences of numbers in a CUDA array

I have an array of unsigned integers stored on the GPU with CUDA (typically 1000000 elements). I would like to count the occurrence of every number in the array. There are only a few distinct numbers (about 10), but these numbers can span from 1 to 1000000. About 9/10th of the numbers are 0, I don't need the count of them. The result looks something like this:
58458 -> 1000 occurrences
15 -> 412 occurrences
I have an implementation using atomicAdds, but it is too slow (a lot of threads write to the same address). Does someone know of a fast/efficient method?
You can implement a histogram by first sorting the numbers, and then doing a keyed reduction.
The most straightforward method would be to use thrust::sort and then thrust::reduce_by_key. It's also often much faster than ad hoc binning based on atomics. Here's an example.
I suppose you can find help in the CUDA examples, specifically the histogram examples. They are part of the GPU computing SDK.
You can find it here http://developer.nvidia.com/cuda-cc-sdk-code-samples#histogram. They even have a whitepaper explaining the algorithms.
I'm comparing two approaches suggested at the duplicate question thrust count occurence, namely,
Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example;
Using thrust::unique_copy and thrust::upper_bound.
Below, please find a fully worked example.
#include <time.h> // --- time
#include <stdlib.h> // --- srand, rand
#include <iostream>
#include <thrust\host_vector.h>
#include <thrust\device_vector.h>
#include <thrust\sort.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\unique.h>
#include <thrust/binary_search.h>
#include <thrust\adjacent_difference.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
//#define VERBOSE
#define NO_HISTOGRAM
/********/
/* MAIN */
/********/
int main() {
const int N = 1048576;
//const int N = 20;
//const int N = 128;
TimingGPU timerGPU;
// --- Initialize random seed
srand(time(NULL));
thrust::host_vector<int> h_code(N);
for (int k = 0; k < N; k++) {
// --- Generate random numbers between 0 and 9
h_code[k] = (rand() % 10);
}
thrust::device_vector<int> d_code(h_code);
//thrust::device_vector<unsigned int> d_counting(N);
thrust::sort(d_code.begin(), d_code.end());
h_code = d_code;
timerGPU.StartCounter();
#ifdef NO_HISTOGRAM
// --- The number of d_cumsum bins is equal to the maximum value plus one
int num_bins = d_code.back() + 1;
thrust::device_vector<int> d_code_unique(num_bins);
thrust::unique_copy(d_code.begin(), d_code.end(), d_code_unique.begin());
thrust::device_vector<int> d_counting(num_bins);
thrust::upper_bound(d_code.begin(), d_code.end(), d_code_unique.begin(), d_code_unique.end(), d_counting.begin());
#else
thrust::device_vector<int> d_cumsum;
// --- The number of d_cumsum bins is equal to the maximum value plus one
int num_bins = d_code.back() + 1;
// --- Resize d_cumsum storage
d_cumsum.resize(num_bins);
// --- Find the end of each bin of values - Cumulative d_cumsum
thrust::counting_iterator<int> search_begin(0);
thrust::upper_bound(d_code.begin(), d_code.end(), search_begin, search_begin + num_bins, d_cumsum.begin());
// --- Compute the histogram by taking differences of the cumulative d_cumsum
//thrust::device_vector<int> d_counting(num_bins);
//thrust::adjacent_difference(d_cumsum.begin(), d_cumsum.end(), d_counting.begin());
#endif
printf("Timing GPU = %f\n", timerGPU.GetCounter());
#ifdef VERBOSE
thrust::host_vector<int> h_counting(d_counting);
printf("After\n");
for (int k = 0; k < N; k++) printf("code = %i\n", h_code[k]);
#ifndef NO_HISTOGRAM
thrust::host_vector<int> h_cumsum(d_cumsum);
printf("\nCounting\n");
for (int k = 0; k < num_bins; k++) printf("element = %i; counting = %i; cumsum = %i\n", k, h_counting[k], h_cumsum[k]);
#else
thrust::host_vector<int> h_code_unique(d_code_unique);
printf("\nCounting\n");
for (int k = 0; k < N; k++) printf("element = %i; counting = %i\n", h_code_unique[k], h_counting[k]);
#endif
#endif
}
The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, I have had the following timings for a number of N = 1048576 array elements:
First approach: 2.35ms
First approach without thrust::adjacent_difference: 1.52
Second approach: 4.67ms
Please, note that there is no strict need to calculate the adjacent difference explicitly, since this operation can be manually done during a kernel processing, if needed.
As others have said, you can use the sort & reduce_by_key approach to count frequencies. In my case, I needed to get mode of an array (maximum frequency/occurrence) so here is my solution:
1 - First, we create two new arrays, one containing a copy of input data and another filled with ones to later reduce it (sum):
// Input: [1 3 3 3 2 2 3]
// *(Temp) dev_keys: [1 3 3 3 2 2 3]
// *(Temp) dev_ones: [1 1 1 1 1 1 1]
// Copy input data
thrust::device_vector<int> dev_keys(myptr, myptr+size);
// Fill an array with ones
thrust::fill(dev_ones.begin(), dev_ones.end(), 1);
2 - Then, we sort the keys since the reduce_by_key function needs the array to be sorted.
// Sort keys (see below why)
thrust::sort(dev_keys.begin(), dev_keys.end());
3 - Later, we create two output vectors, for the (unique) keys and their frequencies:
thrust::device_vector<int> output_keys(N);
thrust::device_vector<int> output_freqs(N);
4 - Finally, we perform the reduction by key:
// Reduce contiguous keys: [1 3 3 3 2 2 3] => [1 3 2 1] Vs. [1 3 3 3 3 2 2] => [1 4 2]
thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator> new_end;
new_end = thrust::reduce_by_key(dev_keys.begin(), dev_keys.end(), dev_ones.begin(), output_keys.begin(), output_freqs.begin());
5 - ...and if we want, we can get the most frequent element
// Get most frequent element
// Get index of the maximum frequency
int num_keys = new_end.first - output_keys.begin();
thrust::device_vector<int>::iterator iter = thrust::max_element(output_freqs.begin(), output_freqs.begin() + num_keys);
unsigned int index = iter - output_freqs.begin();
int most_frequent_key = output_keys[index];
int most_frequent_val = output_freqs[index]; // Frequencies