CUDA: Getting max value and its index in an array - cuda

I have several blocks were each block executes on separate part of an integer array. As an example: block one from array[0] to array[9] and block two from array[10] to array[20].
What is the best way i can get the index of the max value of the array for each block?
Example block one a[0] to a[10] have the following values:
5 10 2 3 4 34 56 3 9 10
So 56 is the largest value at index 6.
I cannot use the shared memory because the size of the array may be very big. Therefore it won't fit. Are there any libraries that allows me to do so fast?
I know about the reduction algorithm, but i think my case is different because i want to get the index of the largest element.

If I understood exactly what you want is : Get the index for the array A of the max value inside it.
If that is true then I would suggest you to use the thrust library:
Here is how you would do it:
#include <thrust/device_vector.h>
#include <thrust/tuple.h>
#include <thrust/reduce.h>
#include <thrust/fill.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <cstdlib>
#include <time.h>
using namespace thrust;
// return the biggest of two tuples
template <class T>
struct bigger_tuple {
__device__ __host__
tuple<T,int> operator()(const tuple<T,int> &a, const tuple<T,int> &b)
{
if (a > b) return a;
else return b;
}
};
template <class T>
int max_index(device_vector<T>& vec) {
// create implicit index sequence [0, 1, 2, ... )
counting_iterator<int> begin(0); counting_iterator<int> end(vec.size());
tuple<T,int> init(vec[0],0);
tuple<T,int> smallest;
smallest = reduce(make_zip_iterator(make_tuple(vec.begin(), begin)), make_zip_iterator(make_tuple(vec.end(), end)),
init, bigger_tuple<T>());
return get<1>(smallest);
}
int main(){
thrust::host_vector<int> h_vec(1024);
thrust::sequence(h_vec.begin(), h_vec.end()); // values = indices
// transfer data to the device
thrust::device_vector<int> d_vec = h_vec;
int index = max_index(d_vec);
std::cout << "Max index is:" << index <<std::endl;
std::cout << "Value is: " << h_vec[index] <<std::endl;
return 0;
}

This will not benefit the original poster but for those who came to this page looking for an answer I would second the recommendation to use thrust that already has a function thrust::max_element that does exactly that - returns an index of the largest element. min_element and minmax_element functions are also provided. See thrust documentation for details here.

As well as the suggestion to use Thrust, you could also use the CUBLAS cublasIsamax function.

The size of your array in comparison to shared memory is almost irrelevant, since the number of threads in each block is the limiting factor rather than the size of the array. One solution is to have each thread block work on a size of the array the same size as the thread block. That is, if you have 512 threads, then block n will be looking at array[ n ] thru array[ n + 511 ]. Each block does a reduction to find the highest member in that portion of the array. Then you bring the max of each section back to the host and do a simple linear search to locate the highest value in the overall array. Each reduction no the GPU reduces the linear search by a factor of 512. Depending on the size of the array, you might want to do more reductions before you bring the data back. (If your array is 3*512^10 in size, you might want to do 10 reductions on the gpu, and have the host search through the 3 remaining data points.)

One thing to watch out for when doing a max value plus index reduction is that if there is more than one identical valued maximum element in your array, i.e. in your example if there were 2 or more values equal to 56, then the index which is returned would not be unique and possibly be different on every run of the code because the timing of the thread ordering over the GPU is not deterministic.
To get around this kind of problem you can use a unique ordering index such as threadid + threadsperblock * blockid, or else the element index location if that is unique. Then the max test is along these lines:
if(a>max_so_far || a==max_so_far && order_a>order_max_so_far)
{
max_so_far = a;
index_max_so_far = index_a;
order_max_so_far = order_a;
}
(index and order can be the same variable, depending on the application.)

Related

Thrust: Stream compaction copying only first N valid elements

I have a const thrust vector of elements from which I would like to extract at most N elements that pass a predicate (in any order), where the thrust vector size and N are known at compile-time. In my specific case, my vector is 500k elements and N is 100k.
My initial thought was to use thrust::copy_if to get all elements that pass the predicate, then to use only the first N elements for my subsequent calculations. However, in that case I would have to allocate two vectors of 500k elements (one for the initial vector, and one for the output of copy_if) and I'd have to process every element.
As this is an operation I have to do many times and across several CUDA streams, I would like to know if there is a way to obtain the N output elements while minimizing the memory footprint required, and ideally, minimizing the number of elements that need to be processed (i.e. breaking the process once N valid elements have been found).
One possible method to perform a stream compaction operation is to perform a predicated prefix-sum followed by a conditional indexed copy. By breaking a "monolithic" operation into these 2 pieces, it becomes fairly easy to insert the desired limiting behavior on output size.
The prefix sum is a fairly involved operation. We will use thrust for that. The conditional indexed copy is fairly trivial, so we will write our own CUDA kernel for that, rather than try to wrestle with a thrust::copy_if operation to get the copy logic just right. This kernel is where we will insert the limiting behavior on the output size.
Here is a worked example:
$ cat t34.cu
#include <thrust/scan.h>
#include <thrust/copy.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/transform_iterator.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
using namespace thrust::placeholders;
typedef int mt;
__global__ void my_copy(mt *d, int *i, mt *r, int limit, int size){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < size){
if ((idx == 0) && (*i == 1) && (limit > 0))
*r = *d;
else if ((idx > 0) && (i[idx] > i[idx-1]) && (i[idx] <= limit)){
r[i[idx]-1] = d[idx];}
}
}
int main(){
int rs = 3;
mt d[] = {0, 1, 0, 2, 0, 3, 0, 4, 0, 5};
int ds = sizeof(d)/sizeof(d[0]);
thrust::device_vector<mt> data(d, d+ds);
thrust::device_vector<int> idx(ds);
thrust::device_vector<mt> result(rs);
auto my_cmp = thrust::make_transform_iterator(data.begin(), 0+(_1>0));
thrust::inclusive_scan(my_cmp, my_cmp+ds, idx.begin());
my_copy<<<(ds+255)/256, 256>>>(thrust::raw_pointer_cast(data.data()), thrust::raw_pointer_cast(idx.data()), thrust::raw_pointer_cast(result.data()), rs, ds);
thrust::host_vector<mt> h_result = result;
thrust::copy_n(h_result.begin(), rs, std::ostream_iterator<mt>(std::cout, ","));
std::cout << std::endl;
}
$ nvcc -std=c++14 -o t34 t34.cu -arch=sm_52
$ ./t34
1,2,3,
$
(CUDA 11.0, Fedora 29, GTX 960)
Note that this code is provided for demonstration purposes. You should not assume that it is defect-free or suitable for any particular purpose. Use it at your own risk.
A bit of study with a profiler will show that the thrust::inclusive_scan operation does perform a cudaMalloc and cudaFree operation "under the hood". So even though we have pulled most of the allocations "out into the open" here, thrust apparently still needs to perform a single temporary allocation (of unknown size) to support the scan operation.
Responding to a question in the comments below. To understand this: 0+(_1>0), there are two things to note:
The general syntax is using thrust::placeholders. This capability of thrust allows us to write simple unary or binary functions inline, avoiding the need to use lambdas or write separate functors.
The reason for the 0+ is as follows. If we simply used (_1>0), then thrust would use as its unary function a boolean test of the item returned by dereferencing the iterator, compared to zero. The result of that comparison is a boolean, and if we leave it that way, the prefix sum will ultimately be computed using boolean arithmetic, which we do not want. We want the result of the boolean greater-than test (i.e. true/false) to be converted to an integer, so that the subsequent prefix sum gets performed using integer arithmetic. Prepending the (_1>0) boolean test with 0+ accomplishes that.

Sequential operation in GPU implementation

I have to implement the following algorithm in GPU
for(int I = 0; I < 1000; I++){
VAR1[I+1] = VAR1[I] + VAR2[2*K+(I-1)];//K is a constant
}
Each iteration is dependent on previous so the parallelizing is difficult. I am not sure if atomic operation is valid here. What can I do?
EDIT:
The VAR1 and VAR2 both are 1D array.
VAR1[0] = 1
This is in a category of problems called recurrence relations. Depending on the structure of the recurrence relation, there may exist closed form solutions that describe how to compute each element individually (i.e. in parallel, without recursion). One of the early seminal papers (on parallel computation) was Kogge and Stone, and there exist recipes and strategies for parallelizing specific forms.
Sometimes recurrence relations are so simple that we can identify a closed-form formula or algorithm with a little bit of "inspection". This short tutorial gives a little bit more treatment of this idea.
In your case, let's see if we can spot anything just by mapping out what the first few terms of VAR1 should look like, substituting previous terms into newer terms:
i VAR1[i]
___________________
0 1
1 1 + VAR2[2K-1]
2 1 + VAR2[2K-1] + VAR2[2K]
3 1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1]
4 1 + VAR2[2K-1] + VAR2[2K] + VAR2[2K+1] + VAR2[2K+2]
...
Hopefully what jumps out at you is that the VAR2[] terms above follow a pattern of a prefix sum.
This means one possible solution method could be given by:
VAR1[i] = 1+prefix_sum(VAR2[2K + (i-2)]) (for i > 0) notes:(1) (2)
VAR1[i] = 1 (for i = 0)
Now, a prefix sum can be done in parallel (this is not truly a fully independent operation, but it can be parallelized. I don't want to argue too much about terminology or purity here. I'm offering one possible method of parallelization for your stated problem, not the only way to do it.) To do a prefix sum in parallel on the GPU, I would use a library like CUB or Thrust. Or you can write your own although I wouldn't recommend it.
Notes:
the use of -1 or -2 as an offset to i for the prefix sum may be dictated by your use of an inclusive or exclusive scan or prefix sum operation.
VAR2 must be defined over an appropriate domain to make this sensible. However that requirement is implicit in your problem statement.
Here is a trivial worked example. In this case, since the VAR2 indexing term 2K+(I-1) just represents a fixed offset to I (2K-1), we are simply using an offset of 0 for demonstration purposes, so VAR2 is just a simple array over the same domain as VAR1. And I am defining VAR2 to just be an array of all 1, for demonstration purposes. The gpu parallel computation occurs in the VAR1 vector, the CPU equivalent computation is just computed on-the-fly in the cpu variable for validation purposes:
$ cat t1056.cu
#include <thrust/scan.h>
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/transform.h>
#include <iostream>
const int dsize = 1000;
using namespace thrust::placeholders;
int main(){
thrust::device_vector<int> VAR2(dsize, 1); // initialize VAR2 array to all 1's
thrust::device_vector<int> VAR1(dsize);
thrust::exclusive_scan(VAR2.begin(), VAR2.end(), VAR1.begin(), 0); // put prefix sum of VAR2 into VAR1
thrust::transform(VAR1.begin(), VAR1.end(), VAR1.begin(), _1 += 1); // add 1 to every term
int cpu = 1;
for (int i = 1; i < dsize; i++){
int gpu = VAR1[i];
cpu += VAR2[i];
if (cpu != gpu) {std::cout << "mismatch at: " << i << " was: " << gpu << " should be: " << cpu << std::endl; return 1;}
}
std::cout << "Success!" << std::endl;
return 0;
}
$ nvcc -o t1056 t1056.cu
$ ./t1056
Success!
$
For an additional reference particular to the usage of scan operations to solve linear recurrence problems, refer to Blelloch's paper here section 1.4. This question/answer gives an example of how to implement the equation 1.5 in that paper for a more general first-order recurrence case. This question considers the second-order recurrence case.

use of constant in cuda is not accessed in the kernel

in the cuda code ,I am trying to use a structure and constant structure object and the value is assigned to constant object using cudaMemcpyToSymbol but this constant values are not accessed . I know the actual use of constant is not this way as each thread needs to access different values and cannot take advantage of memory broadcast to half warp but here in some situation I need this way
#include <iostream>
#include <stdio.h>
#include <cuda.h>
using namespace std;
struct CDistance
{
int Magnitude;
int Direction;
};
__constant__ CDistance *c_daSTLDistance;
__global__ static void CalcSTLDistance_Kernel(CDistance *m_daSTLDistance)
{
int ID = threadIdx.x;
m_daSTLDistance[ID].Magnitude = m_daSTLDistance[ID].Magnitude + c_daSTLDistance[ID].Magnitude ;
m_daSTLDistance[ID].Direction = 2 ;
}
// main routine that executes on the host
int main(void)
{
CDistance *m_haSTLDistance,*m_daSTLDistance;
m_haSTLDistance = new CDistance[10];
for(int i=0;i<10;i++)
{
m_haSTLDistance[i].Magnitude=3;
m_haSTLDistance[i].Direction=2;
}
//m_haSTLDistance =(CDistance*)malloc(100 * sizeof(CDistance));
cudaMalloc((void**)&m_daSTLDistance,sizeof(CDistance)*10);
cudaMemcpy(m_daSTLDistance, m_haSTLDistance,sizeof(CDistance)*10, cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
CalcSTLDistance_Kernel<<< 1, 100 >>> (m_daSTLDistance);
cudaMemcpy(m_haSTLDistance, m_daSTLDistance, sizeof(CDistance)*10, cudaMemcpyDeviceToHost);
for (int i=0;i<10;i++){
cout<<m_haSTLDistance[i].Magnitude<<endl;
}
free(m_haSTLDistance);
cudaFree(m_daSTLDistance);
}
here in the output, the constant c_daSTLDistance[ID].Magnitude is not accessed in the kernel and the statically assigned value 3 is obtained whereas I want this device value 3 is added to constant value and total 6 is returned.
while looking in to the cuda-memcheck it says error in read operation with memory out of bound
Your code doesn't work because of an uninitialised pointer/buffer overflow problem around the use of c_daSTLDistance. It is illegal to do this:
__constant__ CDistance *c_daSTLDistance;
....
cudaMemcpyToSymbol(c_daSTLDistance, m_haSTLDistance, sizeof(m_daSTLDistance)*10);
No memory was every allocated or a valid value set for c_daSTLDistance.
Further, note that all constant memory variables must be statically defined, and there is no ability to dynamically allocate constant memory at runtime. Therefore, what you are attempting to do can't be made to work. Also note that on all but the very oldest of CUDA devices, kernel arguments are stored in constant memory. So if you had a trivially small array of constant structures, it would be far easier and simpler to pass them by value to the kernel. The compiler and runtime will automagically place them in constant memory for you without any explicit host API calls.

Making CUB blockradixsort on-chip entirely?

I am reading the CUB documentations and examples:
#include <cub/cub.cuh> // or equivalently <cub/block/block_radix_sort.cuh>
__global__ void ExampleKernel(...)
{
// Specialize BlockRadixSort for 128 threads owning 4 integer items each
typedef cub::BlockRadixSort<int, 128, 4> BlockRadixSort;
// Allocate shared memory for BlockRadixSort
__shared__ typename BlockRadixSort::TempStorage temp_storage;
// Obtain a segment of consecutive items that are blocked across threads
int thread_keys[4];
...
// Collectively sort the keys
BlockRadixSort(temp_storage).Sort(thread_keys);
...
}
In the example, each thread has 4 keys. It looks like 'thread_keys' will be allocated in global local memory. If I only has 1 key per thread, could I declare"int thread_key;" and make this variable in register only?
BlockRadixSort(temp_storage).Sort() is taking a pointer to the key as parameter. Does it mean that the keys have to be in global memory?
I would like to use this code but I want each thread to hold one key in register and keep it on-chip in register/shared memory after they are sorted.
Thanks in advance!
You can do this using shared memory (which will keep it "on-chip"). I'm not sure I know how to do it using strictly registers without de-constructing the BlockRadixSort object.
Here's an example code that uses shared memory to hold the initial data to be sorted, and the final sorted results. This sample is mostly set up for one data element per thread, since that seems to be what you are asking for. It's not difficult to extend it to multiple elements per thread, and I have put most of the plumbing in place to do that, with the exception of the data synthesis and debug printouts:
#include <cub/cub.cuh>
#include <stdio.h>
#define nTPB 32
#define ELEMS_PER_THREAD 1
// Block-sorting CUDA kernel (nTPB threads each owning ELEMS_PER THREAD integers)
__global__ void BlockSortKernel()
{
__shared__ int my_val[nTPB*ELEMS_PER_THREAD];
using namespace cub;
// Specialize BlockRadixSort collective types
typedef BlockRadixSort<int, nTPB, ELEMS_PER_THREAD> my_block_sort;
// Allocate shared memory for collectives
__shared__ typename my_block_sort::TempStorage sort_temp_stg;
// need to extend synthetic data for ELEMS_PER_THREAD > 1
my_val[threadIdx.x*ELEMS_PER_THREAD] = (threadIdx.x + 5)%nTPB; // synth data
__syncthreads();
printf("thread %d data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
// Collectively sort the keys
my_block_sort(sort_temp_stg).Sort(*static_cast<int(*)[ELEMS_PER_THREAD]>(static_cast<void*>(my_val+(threadIdx.x*ELEMS_PER_THREAD))));
__syncthreads();
printf("thread %d sorted data = %d\n", threadIdx.x, my_val[threadIdx.x*ELEMS_PER_THREAD]);
}
int main(){
BlockSortKernel<<<1,nTPB>>>();
cudaDeviceSynchronize();
}
This seems to work correctly for me, in this case I happened to be using RHEL 5.5/gcc 4.1.2, CUDA 6.0 RC, and CUB v1.2.0 (which is quite recent).
The strange/ugly static casting is needed as far as I can tell, because the CUB Sort is expecting a reference to an array of length equal to the customization parameter ITEMS_PER_THREAD(i.e. ELEMS_PER_THREAD):
__device__ __forceinline__ void Sort(
Key (&keys)[ITEMS_PER_THREAD],
int begin_bit = 0,
int end_bit = sizeof(Key) * 8)
{ ...

CUDA: Max of array, how to prevent write collisions?

I have an array of doubles stored in GPU global memory and i need to find the maximum value in it. I have read some texts about parallel reduction, so i know that one should divide the array between blocks and make them find their "global maximum", and so on.
But they never seem to address the issue of threads trying to write to the same memory position simultaneously.
Let's say that local_max=0.0 in the beginning of a block execution. Then each thread reads their value from the input vector, decides that is larger than local_max, and then try to write their value to local_max. When all of this happens at the exact same time (atleast when inside the same warp), how can this work and end up with the actual maximum within this block?
I would think either an atomic function or some kind of lock or critical section would be needed, but i haven't seen this addressed in the answers i have found. (ex http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf )
The answer to your questions are contained in the very document you linked to, and the SDK reduction example shows concrete implementations of the reduction concept.
For completeness, here is a concrete example of a reduction kernel:
template <typename T, int BLOCKSIZE>
__global__ reduction(T *inputvals, T *outputvals, int N)
{
__shared__ volatile T data[BLOCKSIZE];
T maxval = inputvals[threadIdx.x];
for(int i=blockDim.x + threadIdx.x; i<N; i+=blockDim.x)
{
maxfunc(maxval, inputvals[i]);
}
data[threadIdx.x] = maxval;
__syncthreads();
// Here maxfunc(a,b) sets a to the minimum of a and b
if (threadIdx.x < 32) {
for(int i=32+threadIdx.x; i < BLOCKSIZE; i+= 32) {
maxfunc(data[threadIdx.x], data[i]);
}
if (threadIdx.x < 16) maxfunc(data[threadIdx.x], data[threadIdx.x+16]);
if (threadIdx.x < 8) maxfunc(data[threadIdx.x], data[threadIdx.x+8]);
if (threadIdx.x < 4) maxfunc(data[threadIdx.x], data[threadIdx.x+4]);
if (threadIdx.x < 2) maxfunc(data[threadIdx.x], data[threadIdx.x+2]);
if (threadIdx.x == 0) {
maxfunc(data[0], data[1]);
outputvals[blockIdx.x] = data[0];
}
}
}
The key point is using the synchronization that is implicit within a warp to perform the reduction in shared memory. The result is a single per-block maximum value. A second reduction pass is required to reduce the set of block maximums to the global maximum (often it is faster to o this on the host). In this example, maxvals is the "compare and set" function which could be as simple as
template<T>
__device__ void maxfunc(T & a, T & b)
{
a = (b > a) ? b : a;
}
Dont' cook your own code, use some thrust (included in version 4.0 of the Cuda sdk) :
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/copy.h>
#include <iostream>
int main(void)
{
thrust::host_vector<int> h_vec(10000);
thrust::sequence(h_vec.begin(), h_vec.end());
// show hvec
thrust::copy(h_vec.begin(), h_vec.end(),
std::ostream_iterator<int>(std::cout, "\n"));
// transfer to device
thrust::device_vector<int> d_vec = h_vec;
int max_dvec_value = *thrust::max_element(d_vec.begin(), d_vec.end());
std::cout << "max value: " << max_dvec_value << "\n";
return 0;
}
And watch out that thrust::max_element returns a pointer.
Your question is clearly answered in the document you link to. I think you just need to spend some more time reading it and understanding the CUDA concepts used in it. In particular, I would focus on shared memory, the __syncthreads() method, and how to uniquely identify a thread while inside a kernel. Additionally, you should try to understand why the reduction may need to be run in 2 passes to find the global maximum.