Real scaled Sparse matrix vector multiplication in Cusp?

Real scaled Sparse matrix vector multiplication in Cusp? - cuda

In cusp, there is a multiply to calculate spmv(sparse matrix vector multiplication) that takes a reduce and a combine:
template <typename LinearOperator,
typename MatrixOrVector1,
typename MatrixOrVector2,
typename UnaryFunction,
typename BinaryFunction1,
typename BinaryFunction2>
void multiply(const LinearOperator& A,
const MatrixOrVector1& B,
MatrixOrVector2& C,
UnaryFunction initialize,
BinaryFunction1 combine,
BinaryFunction2 reduce);
From the interface it seems like custom combine and reduce should be possible for any matrix/vector multiplication. I think cusp supports to use other combine and reduce function defined in thrust/functional.h besides multiplication and plus to calculate spmv. For example, can I use thrust::plus to replace multiplication the original combine function(i.e. multiplication)?
And I guess, this scaled spmv also support those sparse matrix in coo,csr,dia,hyb format.
However, I got a wrong answer when I tested the below example in a.cu whose matrix A was in coo format.
It used plus operator to combine. And I compiled it with cmd : nvcc a.cu -o a to .
#include <cusp/csr_matrix.h>
#include <cusp/monitor.h>
#include <cusp/multiply.h>
#include <cusp/print.h>
#include <cusp/krylov/cg.h>
int main(void)
{
// COO format in host memory
int host_I[13] = {0,0,1,1,2,2,2,3,3,3,4,5,5}; // COO row indices
int host_J[13] = {0,1,1,2,2,4,6,3,4,5,5,5,6}; // COO column indices
int host_V[13] = {1,1,1,1,1,1,1,1,1,1,1,1,1};
// x and y arrays in host memory
int host_x[7] = {1,1,1,1,1,1,1};
int host_y[6] = {0,0,0,0,0,0};
// allocate device memory for COO format
int * device_I;
cudaMalloc(&device_I, 13 * sizeof(int));
int * device_J;
cudaMalloc(&device_J, 13 * sizeof(int));
int * device_V;
cudaMalloc(&device_V, 13 * sizeof(int));
// allocate device memory for x and y arrays
int * device_x;
cudaMalloc(&device_x, 7 * sizeof(int));
int * device_y;
cudaMalloc(&device_y, 6 * sizeof(int));
// copy raw data from host to device
cudaMemcpy(device_I, host_I, 13 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_J, host_J, 13 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_V, host_V, 13 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_x, host_x, 7 * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(device_y, host_y, 6 * sizeof(int), cudaMemcpyHostToDevice);
// matrices and vectors now reside on the device
// *NOTE* raw pointers must be wrapped with thrust::device_ptr!
thrust::device_ptr<int> wrapped_device_I(device_I);
thrust::device_ptr<int> wrapped_device_J(device_J);
thrust::device_ptr<int> wrapped_device_V(device_V);
thrust::device_ptr<int> wrapped_device_x(device_x);
thrust::device_ptr<int> wrapped_device_y(device_y);
// use array1d_view to wrap the individual arrays
typedef typename cusp::array1d_view< thrust::device_ptr<int> > DeviceIndexArrayView;
typedef typename cusp::array1d_view< thrust::device_ptr<int> > DeviceValueArrayView;
DeviceIndexArrayView row_indices (wrapped_device_I, wrapped_device_I + 13);
DeviceIndexArrayView column_indices(wrapped_device_J, wrapped_device_J + 13);
DeviceValueArrayView values (wrapped_device_V, wrapped_device_V + 13);
DeviceValueArrayView x (wrapped_device_x, wrapped_device_x + 7);
DeviceValueArrayView y (wrapped_device_y, wrapped_device_y + 6);
// combine the three array1d_views into a coo_matrix_view
typedef cusp::coo_matrix_view<DeviceIndexArrayView,
DeviceIndexArrayView,
DeviceValueArrayView> DeviceView;
// construct a coo_matrix_view from the array1d_views
DeviceView A(6, 7, 13, row_indices, column_indices, values);
std::cout << "\ndevice coo_matrix_view" << std::endl;
cusp::print(A);
cusp::constant_functor<int> initialize;
thrust::plus<int> combine;
thrust::plus<int> reduce;
cusp::multiply(A , x , y , initialize, combine, reduce);
std::cout << "\nx array" << std::endl;
cusp::print(x);
std::cout << "\n y array, y = A * x" << std::endl;
cusp::print(y);
cudaMemcpy(host_y, device_y, 6 * sizeof(int), cudaMemcpyDeviceToHost);
// free device arrays
cudaFree(device_I);
cudaFree(device_J);
cudaFree(device_V);
cudaFree(device_x);
cudaFree(device_y);
return 0;
}
And I got the below answer.
device coo_matrix_view
sparse matrix <6, 7> with 13 entries
0 0 (1)
0 1 (1)
1 1 (1)
1 2 (1)
2 2 (1)
2 4 (1)
2 6 (1)
3 3 (1)
3 4 (1)
3 5 (1)
4 5 (1)
5 5 (1)
5 6 (1)
x array
array1d <7>
(1)
(1)
(1)
(1)
(1)
(1)
(1)
y array, y = A * x
array1d <6>
(4)
(4)
(6)
(6)
(2)
(631)
The vector y I got is strange, I think the correct answer y should be:
[9,
9,
10,
10,
8,
9]
So I do not sure that whether such replacement of combine and reduce can be adapted to other sparse matrix format, like coo. Or maybe the code I wrote above is incorrect to call multiply.
Can you give me some help? Any info will help.
Thank you!

From a very brief reading of the code and instrumentation of your example, this seems to be something badly broken in CUSP causing the problem for this usage case. The code only appears to accidentally work correctly for the case where the combine operator is multiplication because the spurious operations it performs with zero elements do not effect the reduction operation (ie. it just sums a lot of additional zeros).

Related

Is DeviceToDevice copy in a CUDA program needed?

I am doing the following two operations:
Addition of two array => a + b = AddResult
Multiply of two arrays => AddResult * a = MultiplyResult
In the above logic, the AddResult is an intermediate result and is used in the next mupltiplication operation as input.
#define N 4096 // size of array
__global__ void add(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] + b[tid];
}
}
__global__ void multiply(const int* a, const int* b, int* c)
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
if (tid < N)
{
c[tid] = a[tid] * b[tid];
}
}
int main()
{
int T = 1024, B = 4; // threads per block and blocks per grid
int a[N], b[N], c[N], d[N], e[N];
int* dev_a, * dev_b, * dev_AddResult, * dev_Temp, * dev_MultiplyResult;
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_AddResult, N * sizeof(int));
cudaMalloc((void**)&dev_Temp, N * sizeof(int));
cudaMalloc((void**)&dev_MultiplyResult, N * sizeof(int));
for (int i = 0; i < N; i++)
{
// load arrays with some numbers
a[i] = i;
b[i] = i * 1;
}
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_AddResult, c, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_Temp, d, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_MultiplyResult, e, N * sizeof(int), cudaMemcpyHostToDevice);
//ADD
add << <B, T >> > (dev_a, dev_b, dev_AddResult);
cudaDeviceSynchronize();
//Multiply
cudaMemcpy(dev_Temp, dev_AddResult, N * sizeof(int), cudaMemcpyDeviceToDevice); //<---------DO I REALLY NEED THIS?
multiply << <B, T >> > (dev_a, dev_Temp, dev_MultiplyResult);
//multiply << <B, T >> > (dev_a, dev_AddResult, dev_MultiplyResult);
//Copy Final Results D to H
cudaMemcpy(e, dev_MultiplyResult, N * sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
{
printf("(%d+%d)*%d=%d\n", a[i], b[i], a[i], e[i]);
}
// clean up
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_AddResult);
cudaFree(dev_Temp);
cudaFree(dev_MultiplyResult);
return 0;
}
In the above sample code, I am transferring the Addition results (i.e. dev_AddResult) to another device array (i.e. dev_Temp) to perform the multiplication operation.
QUESTION: Since the Addition results array (i.e. dev_AddResult) is already on the GPU device, do I really need to transfer it to another array? I have already tried to execute the next kernel by directly providing dev_AddResult as input and it produced the same results. Is there any risk involved in directly passing the output of one kernel as the input of the next kernel? Any best practices to follow?

Yes, for the case you have shown, you can use the "output" of one kernel as the "input" to the next, without any copying. You've already done that and confirmed it works, so I will dispense with any example. The changes are trivial anyway - eliminate the intervening cudaMemcpy operation, and use the same dev_AddResult pointer in place of the dev_Temp pointer on your multiply kernel invocation.
Regarding "risks" I'm not aware of any for the example you have given. Moving away from that example to possibly more general usage, you would want to make sure that the add output calculations are finished before being used somewhere else.
Your example already does this, redundantly, using at least 2 mechanisms:
intervening cudaDeviceSynchronize() - this forces the previously issued work to complete
stream semantics - one rule of stream semantics is that work issued into a particular stream will execute in issue order. Item B issued into stream X, will not begin until the previously issued item A into stream X has completed.
So you don't really need the cudaDeviceSynchronize() in this case. It isn't "hurting" anything from a functionality perspective, but it is probably adding a few microseconds to the overall execution time.
More generally, if you had issued your add and multiply kernel into separate streams, then CUDA provides no guarantees of execution order, even though you "issued" the multiply kernel after the add kernel.
In that case (not the one you have here) if you needed the multiply operation to use the previously computed add results, you would need to enforce that somehow (enforce the completion of the add kernel before the multiply kernel). You have already shown one method to do that here, using a synchronize call.

How to use CUDA tex1DFetch with cudaTextureObject_t?

I was working with texture references when I noticed they were deprecated, I tried to update my test function to work with the 'new' bindless texture objects with tex1Dfetch but was not able to produce the same results.
I'm currently exploring the use of texture memory to speed up my aho-corasick implementation; I was able to get tex1D() working with texture references, however, I noticed they were deprecated and decided to use texture objects instead.
I'm getting some immensely weird behaviour with the kernels when I try to use the results in any way; I can do results[tidx] = tidx; without any issues, but results[tidx] = temp + 1; only ever returns the value of temp not temp * 3 or any other numerical test involving temp.
I can see no logical reason for this behaviour, and the documentation examples look similar enough that I can't see where I've gone wrong.
I've already read CUDA tex1Dfetch() wrong behaviour and New CUDA Texture Object — getting wrong data in 2D case but neither seem related to the issue I am having.
Just in case it makes a difference; I am am using CUDA release 10.0, V10.0.130 with an Nvidia GTX 980ti.
#include <iostream>
__global__ void test(cudaTextureObject_t tex ,int* results){
int tidx = threadIdx.y * blockDim.x + threadIdx.x;
unsigned temp = tex1Dfetch<unsigned>(tex, threadIdx.x);
results[tidx] = temp * 3;
}
int main(){
int *host_arr;
const int host_arr_size = 8;
// Create and populate host array
std::cout << "Host:" << std::endl;
cudaMallocHost(&host_arr, host_arr_size*sizeof(int));
for (int i = 0; i < host_arr_size; ++i){
host_arr[i] = i * 2;
std::cout << host_arr[i] << std::endl;
}
// Create resource description
struct cudaResourceDesc resDesc;
resDesc.resType = cudaResourceTypeLinear;
resDesc.res.linear.devPtr = &host_arr;
resDesc.res.linear.sizeInBytes = host_arr_size*sizeof(unsigned);
resDesc.res.linear.desc = cudaCreateChannelDesc<unsigned>();
// Create texture description
struct cudaTextureDesc texDesc;
texDesc.readMode = cudaReadModeElementType;
// Create texture
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);
// Allocate results array
int * result_arr;
cudaMalloc(&result_arr, host_arr_size*sizeof(unsigned));
// launch test kernel
test<<<1, host_arr_size>>>(tex, result_arr);
// fetch results
std::cout << "Device:" << std::endl;
cudaMemcpy(host_arr, result_arr, host_arr_size*sizeof(unsigned), cudaMemcpyDeviceToHost);
// print results
for (int i = 0; i < host_arr_size; ++i){
std::cout << host_arr[i] << std::endl;
}
// Tidy Up
cudaDestroyTextureObject(tex);
cudaFreeHost(host_arr);
cudaFree(result_arr);
}
I expected the above to work similarly to the below (which does work):
texture<int, 1, cudaReadModeElementType> tex_ref;
cudaArray* cuda_array;
__global__ void test(int* results){
const int tidx = threadIdx.x;
results[tidx] = tex1D(tex_ref, tidx) * 3;
}
int main(){
int *host_arr;
int host_arr_size = 8;
// Create and populate host array
cudaMallocHost((void**)&host_arr, host_arr_size * sizeof(int));
for (int i = 0; i < host_arr_size; ++i){
host_arr[i] = i * 2;
std::cout << host_arr[i] << std::endl;
}
// bind to texture
cudaChannelFormatDesc cuDesc = cudaCreateChannelDesc <int >();
cudaMallocArray(&cuda_array, &cuDesc, host_arr_size);
cudaMemcpyToArray(cuda_array, 0, 0, host_arr , host_arr_size * sizeof(int), cudaMemcpyHostToDevice);
cudaBindTextureToArray(tex_ref , cuda_array);
// Allocate results array
int * result_arr;
cudaMalloc((void**)&result_arr, host_arr_size*sizeof(int));
// launch kernel
test<<<1, host_arr_size>>>(result_arr);
// fetch results
cudaMemcpy(host_arr, result_arr, host_arr_size * sizeof(int), cudaMemcpyDeviceToHost);
// print results
for (int i = 0; i < host_arr_size; ++i){
std::cout << host_arr[i] << std::endl;
}
// Tidy Up
cudaUnbindTexture(tex_ref);
cudaFreeHost(host_arr);
cudaFreeArray(cuda_array);
cudaFree(result_arr);
}
Expected results:
Host:
0
2
4
6
8
10
12
14
Device:
0
6
12
18
24
30
36
42
Actual results:
Host:
0
2
4
6
8
10
12
14
Device:
0
2
4
6
8
10
12
14
Does anyone know what on earth is going wrong?

CUDA API function calls return error codes. You want to check these error codes. Especially when something is clearly going wrong somewhere…
You use the the same array to store the initial array data as well as to receive the result from the device. Your kernel launch fails with an illegal address error because you do not have a valid texture object. You do not have a valid texture object because the creation of your texture object failed. The first API call right after the kernel launch is the cudaMemcpy() to get the results back. Since there was an error during the kernel launch, cudaMemcpy() will fail, returning the most recent error instead of performing the copy. As a result, the contents of your host_arr buffer are unchanged and you just end up displaying the original input data again.
The reson why creation of your texture object failed is explained in the documentation (emphasis mine):
If cudaResourceDesc::resType is set to cudaResourceTypeLinear, cudaResourceDesc::res::linear::devPtr must be set to a valid device pointer, that is aligned to cudaDeviceProp::textureAlignment. […]
A texture object cannot reference host memory. The problem in your code lies here:
resDesc.res.linear.devPtr = &host_arr;
You need to allocate a buffer in decive memory, e.g., using cudaMalloc(), copy your data there, and create a texture object that refers to that device buffer.
Furthermore, your texDesc is not initialized properly. In your case, it should be sufficient to just zero-initialize it:
struct cudaTextureDesc texDesc = {};

4 steps:
declare
texture<unsigned char,1,cudaReadmodeElementType> tex1;
bind
cudaBindTexture(0,tex1,dev_A);
fetch/read via index
tex1Dfetch(tex1,2);
unbind
cudaUnbindTexture(tex1)；

Cuda: Copy 1D Array From CPU to GPU

I am newbie to Cuda, trying to copy array from Host to Device via cudaMemcpy(...)
However,the data passed to GPU seems to be totally different (for cost: totally wrong, for G: wrong after index of 5)
My data is a malloc array (written in C) of size 25 for example,
I tried to copy through the following way (MAX = 5):
Declaration:
int *cost, int* G
int *dev_cost, *dev_G;
Allocation:
cost = (int*)malloc(MAX* MAX * sizeof(int));
G = (int*)malloc(MAX* MAX* sizeof(int));
cudaMalloc((void**)&dev_cost, MAX*MAX);
cudaMalloc((void**)&dev_G, MAX*MAX);
Data transfer:
cudaMemcpy(dev_cost, cost, MAX*MAX, cudaMemcpyHostToDevice);
cudaMemcpy(dev_G, G, MAX*MAX, cudaMemcpyHostToDevice);
Kernel Trigger:
assignCost<<<1,MAX*MAX>>>(dev_G,dev_cost);
Kernel Function:
__global__ void assignCost(int *G, int *cost)
{
int tid = threadIdx.x + blockDim.x*blockIdx.x;
printf("cost[%d]: %d G[%d] = %d\n", tid, cost[tid], tid, G[tid]);
if(tid<MAX*MAX)
{
if (G[tid] == 0)
cost[tid] = INT_MAX;
else
cost[tid] = G[tid];
}
}
Is there anything wrong with my approach? If then, how should i modify?

cudaMemcpy(dev_cost, cost, MAX*MAX*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_G, G, MAX*MAX*sizeof(int), cudaMemcpyHostToDevice);
The 3rd argument for cudaMemcpy is the count in bytes. As you have MAX*MAX integers and each integer has a size sizeof(int) bytes, replace MAX*MAX as MAX*MAX*sizeof(int)

Converting thrust::iterators to and from raw pointers

I want to use Thrust library to calculate prefix sum of device array in CUDA.
My array is allocated with cudaMalloc(). My requirement is as follows:
main()
{
Launch kernel 1 on data allocated through cudaMalloc()
// This kernel will poplulate some data d.
Use thrust to calculate prefix sum of d.
Launch kernel 2 on prefix sum.
}
I want to use Thrust somewhere between my kernels so I need method to convert pointers to device iterators and back.What is wrong in following code?
int main()
{
int *a;
cudaMalloc((void**)&a,N*sizeof(int));
thrust::device_ptr<int> d=thrust::device_pointer_cast(a);
thrust::device_vector<int> v(N);
thrust::exclusive_scan(a,a+N,v);
return 0;
}

A complete working example from your latest edit would look like this:
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/fill.h>
#include <thrust/copy.h>
#include <cstdio>
int main()
{
const int N = 16;
int * a;
cudaMalloc((void**)&a, N*sizeof(int));
thrust::device_ptr<int> d = thrust::device_pointer_cast(a);
thrust::fill(d, d+N, 2);
thrust::device_vector<int> v(N);
thrust::exclusive_scan(d, d+N, v.begin());
int v_[N];
thrust::copy(v.begin(), v.end(), v_);
for(int i=0; i<N; i++)
printf("%d %d\n", i, v_[i]);
return 0;
}
The things you got wrong:
N not defined anywhere
passing the raw device pointer a rather than the device_ptr d as the input iterator to exclusive_scan
passing the device_vector v to exclusive_scan rather than the appropriate iterator v.begin()
Attention to detail was all that is lacking to make this work. And work it does:
$ nvcc -arch=sm_12 -o thrust_kivekset thrust_kivekset.cu
$ ./thrust_kivekset
0 0
1 2
2 4
3 6
4 8
5 10
6 12
7 14
8 16
9 18
10 20
11 22
12 24
13 26
14 28
15 30
Edit:
thrust::device_vector.data() will return a thrust::device_ptr which points to the first element of the vector. thrust::device_ptr.get() will return a raw device pointer. Therefore
cudaMemcpy(v_, v.data().get(), N*sizeof(int), cudaMemcpyDeviceToHost);
and
thrust::copy(v, v+N, v_);
are functionally equivalent in this example.

Convert your raw pointer obtained from cudaMalloc() to a thrust::device_ptr using thrust::device_pointer_cast. Here's an example from the Thrust docs:
#include <thrust/device_ptr.h>
#include <thrust/fill.h>
#include <cuda.h>
int main(void)
{
size_t N = 10;
// obtain raw pointer to device memory
int * raw_ptr;
cudaMalloc((void **) &raw_ptr, N * sizeof(int));
// wrap raw pointer with a device_ptr
thrust::device_ptr<int> dev_ptr = thrust::device_pointer_cast(raw_ptr);
// use device_ptr in Thrust algorithms
thrust::fill(dev_ptr, dev_ptr + N, (int) 0);
// access device memory transparently through device_ptr
dev_ptr[0] = 1;
// free memory
cudaFree(raw_ptr);
return 0;
}
Use thrust::inclusive_scan or thrust::exclusive_scan to compute the prefix sum.
http://code.google.com/p/thrust/wiki/QuickStartGuide#Prefix-Sums

Counting occurrences of numbers in a CUDA array

I have an array of unsigned integers stored on the GPU with CUDA (typically 1000000 elements). I would like to count the occurrence of every number in the array. There are only a few distinct numbers (about 10), but these numbers can span from 1 to 1000000. About 9/10th of the numbers are 0, I don't need the count of them. The result looks something like this:
58458 -> 1000 occurrences
15 -> 412 occurrences
I have an implementation using atomicAdds, but it is too slow (a lot of threads write to the same address). Does someone know of a fast/efficient method?

You can implement a histogram by first sorting the numbers, and then doing a keyed reduction.
The most straightforward method would be to use thrust::sort and then thrust::reduce_by_key. It's also often much faster than ad hoc binning based on atomics. Here's an example.

I suppose you can find help in the CUDA examples, specifically the histogram examples. They are part of the GPU computing SDK.
You can find it here http://developer.nvidia.com/cuda-cc-sdk-code-samples#histogram. They even have a whitepaper explaining the algorithms.

I'm comparing two approaches suggested at the duplicate question thrust count occurence, namely,
Using thrust::counting_iterator and thrust::upper_bound, following the histogram Thrust example;
Using thrust::unique_copy and thrust::upper_bound.
Below, please find a fully worked example.
#include <time.h> // --- time
#include <stdlib.h> // --- srand, rand
#include <iostream>
#include <thrust\host_vector.h>
#include <thrust\device_vector.h>
#include <thrust\sort.h>
#include <thrust\iterator\zip_iterator.h>
#include <thrust\unique.h>
#include <thrust/binary_search.h>
#include <thrust\adjacent_difference.h>
#include "Utilities.cuh"
#include "TimingGPU.cuh"
//#define VERBOSE
#define NO_HISTOGRAM
/********/
/* MAIN */
/********/
int main() {
const int N = 1048576;
//const int N = 20;
//const int N = 128;
TimingGPU timerGPU;
// --- Initialize random seed
srand(time(NULL));
thrust::host_vector<int> h_code(N);
for (int k = 0; k < N; k++) {
// --- Generate random numbers between 0 and 9
h_code[k] = (rand() % 10);
}
thrust::device_vector<int> d_code(h_code);
//thrust::device_vector<unsigned int> d_counting(N);
thrust::sort(d_code.begin(), d_code.end());
h_code = d_code;
timerGPU.StartCounter();
#ifdef NO_HISTOGRAM
// --- The number of d_cumsum bins is equal to the maximum value plus one
int num_bins = d_code.back() + 1;
thrust::device_vector<int> d_code_unique(num_bins);
thrust::unique_copy(d_code.begin(), d_code.end(), d_code_unique.begin());
thrust::device_vector<int> d_counting(num_bins);
thrust::upper_bound(d_code.begin(), d_code.end(), d_code_unique.begin(), d_code_unique.end(), d_counting.begin());
#else
thrust::device_vector<int> d_cumsum;
// --- The number of d_cumsum bins is equal to the maximum value plus one
int num_bins = d_code.back() + 1;
// --- Resize d_cumsum storage
d_cumsum.resize(num_bins);
// --- Find the end of each bin of values - Cumulative d_cumsum
thrust::counting_iterator<int> search_begin(0);
thrust::upper_bound(d_code.begin(), d_code.end(), search_begin, search_begin + num_bins, d_cumsum.begin());
// --- Compute the histogram by taking differences of the cumulative d_cumsum
//thrust::device_vector<int> d_counting(num_bins);
//thrust::adjacent_difference(d_cumsum.begin(), d_cumsum.end(), d_counting.begin());
#endif
printf("Timing GPU = %f\n", timerGPU.GetCounter());
#ifdef VERBOSE
thrust::host_vector<int> h_counting(d_counting);
printf("After\n");
for (int k = 0; k < N; k++) printf("code = %i\n", h_code[k]);
#ifndef NO_HISTOGRAM
thrust::host_vector<int> h_cumsum(d_cumsum);
printf("\nCounting\n");
for (int k = 0; k < num_bins; k++) printf("element = %i; counting = %i; cumsum = %i\n", k, h_counting[k], h_cumsum[k]);
#else
thrust::host_vector<int> h_code_unique(d_code_unique);
printf("\nCounting\n");
for (int k = 0; k < N; k++) printf("element = %i; counting = %i\n", h_code_unique[k], h_counting[k]);
#endif
#endif
}
The first approach has shown to be the fastest. On an NVIDIA GTX 960 card, I have had the following timings for a number of N = 1048576 array elements:
First approach: 2.35ms
First approach without thrust::adjacent_difference: 1.52
Second approach: 4.67ms
Please, note that there is no strict need to calculate the adjacent difference explicitly, since this operation can be manually done during a kernel processing, if needed.

As others have said, you can use the sort & reduce_by_key approach to count frequencies. In my case, I needed to get mode of an array (maximum frequency/occurrence) so here is my solution:
1 - First, we create two new arrays, one containing a copy of input data and another filled with ones to later reduce it (sum):
// Input: [1 3 3 3 2 2 3]
// *(Temp) dev_keys: [1 3 3 3 2 2 3]
// *(Temp) dev_ones: [1 1 1 1 1 1 1]
// Copy input data
thrust::device_vector<int> dev_keys(myptr, myptr+size);
// Fill an array with ones
thrust::fill(dev_ones.begin(), dev_ones.end(), 1);
2 - Then, we sort the keys since the reduce_by_key function needs the array to be sorted.
// Sort keys (see below why)
thrust::sort(dev_keys.begin(), dev_keys.end());
3 - Later, we create two output vectors, for the (unique) keys and their frequencies:
thrust::device_vector<int> output_keys(N);
thrust::device_vector<int> output_freqs(N);
4 - Finally, we perform the reduction by key:
// Reduce contiguous keys: [1 3 3 3 2 2 3] => [1 3 2 1] Vs. [1 3 3 3 3 2 2] => [1 4 2]
thrust::pair<thrust::device_vector<int>::iterator, thrust::device_vector<int>::iterator> new_end;
new_end = thrust::reduce_by_key(dev_keys.begin(), dev_keys.end(), dev_ones.begin(), output_keys.begin(), output_freqs.begin());
5 - ...and if we want, we can get the most frequent element
// Get most frequent element
// Get index of the maximum frequency
int num_keys = new_end.first - output_keys.begin();
thrust::device_vector<int>::iterator iter = thrust::max_element(output_freqs.begin(), output_freqs.begin() + num_keys);
unsigned int index = iter - output_freqs.begin();
int most_frequent_key = output_keys[index];
int most_frequent_val = output_freqs[index]; // Frequencies

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008