I've tested the following on a GTX 690 GPU with 4GB RAM in Windows 7 x64, Visual C++ 10:
I've written a function that receives 2 vectors and adds into a 3rd vector. The task is broken over 2 GPU devices. I gradually increased the vector size to benchmark GPU performance. The required time linearly increases relative to vector size up to a certain point and then it abruptly jumps up. When I disable each of the GPU cores, the required time stays linear to the end of available memory. I've enclosed a diagram displaying required time versus allocated memory.
You can see the speed diagram here: Speed Comparison Diagram!
Can you tell me what is wrong?
Bests,
Ramin
This is my code:
unsigned BenchMark( unsigned VectorSize )
{
unsigned * D[ 2 ][ 3 ] ;
for ( int i = 0 ; i < 2 ; i++ )
{
cudaSetDevice( i ) ;
for ( int j = 0 ; j < 3 ; j++ )
cudaMalloc( & D[ i ][ j ] , VectorSize * sizeof( unsigned ) ) ;
}
unsigned uStartTime = clock() ;
// TEST
for ( int i = 0 ; i < 2 ; i++ )
{
cudaSetDevice( i ) ;
AddKernel<<<VectorSize/256,256>>>(
D[ i ][ 0 ] ,
D[ i ][ 1 ] ,
D[ i ][ 2 ] ,
VectorSize ) ;
}
cudaDeviceSynchronize() ;
cudaSetDevice( 0 ) ;
cudaDeviceSynchronize() ;
unsigned uEndTime = clock() ;
for ( int i = 0 ; i < 2 ; i++ )
{
cudaSetDevice( i ) ;
for ( int j = 0 ; j < 3 ; j++ )
cudaFree( D[ i ][ j ] ) ;
}
return uEndTime - uStartTime ;
}
__global__ void AddKernel(
const Npp32u * __restrict__ pSource1 ,
const Npp32u * __restrict__ pSource2 ,
Npp32u * __restrict__ pDestination ,
unsigned uLength )
{
unsigned x = blockIdx.x * blockDim.x + threadIdx.x ;
if ( x < uLength )
pDestination[ x ] = pSource1[ x ] + pSource2[ x ] ;
}
I found the answer. The problem happened as SLI was active, I disabled it and now it is working smoothly.
Related
below is a simplified version of a problem that I am trying to solve. Both code snipets compile, but #2 throws an "illegal memory access". Basically, if an array is encapsulated in a structure, passing a pointer to that structure to cudaMalloc creates all kind of problems -- at least the way I do it. I am pretty sure this is due to the fact that the address of dum in the code below is on the host, and so is not accessible inside the kernel. Problem is, I don't know how to create a device version of dum... E.g., using cudaMalloc( (void**)&dum , sizeof(dummy) * 1 ) instead of the new dummy syntax below does not solve the problem. I think I am getting confused with the double pointer used by cudaMalloc.
Of course it may seem silly in this example to encapsulate an array of double in a structure, in the actual code I really need to do this though.
struct dummy
{
double *arr;
};
void allocate( dummy *dum , int n )
{
cudaMalloc( (double**)&(dum->arr) , sizeof(double) * n );
}
__global__ void test( double val , dummy *dum , int n )
{
printf( "test\n" );
for( int ii = 0 ; ii < n ; ii++ )
dum->arr[ii] = val;
}
__global__ void test2( double val , double *arr , int n )
{
printf( "test\n" );
for( int ii = 0 ; ii < n ; ii++ )
arr[ii] = val;
}
int main()
{
int n = 10;
dummy *dum = new dummy;
/* CODE 1: the piece of code below works */
double *p;
gpu_err_chk( cudaMalloc( &p , sizeof(double) * n ) );
test2<<< 1 , 1 >>>( 123.0 , p , n );
gpu_err_chk( cudaDeviceSynchronize() );
/* CODE 2: the piece of code below does not... */
allocate( dum , n );
test<<< 1 , 1 >>>( 123.0 , dum , n );
gpu_err_chk( cudaDeviceSynchronize() );
return 1;
}
After digging through some example in previous posts by Robert, I was able to re-write the code so that it works:
struct dummy
{
double *arr;
};
__global__ void test( dummy *dum , int n )
{
printf( "test\n" );
for( int ii = 0 ; ii < n ; ii++ )
printf( "dum->arr[%d] = %f\n" , ii , dum->arr[ii] );
}
int main()
{
int n = 10;
dummy *dum_d , *dum_h;
srand( time(0) );
dum_h = new dummy;
dum_h->arr = new double[n];
for( int ii = 0 ; ii < n ; ii++ ){
dum_h->arr[ii] = double( rand() ) / RAND_MAX;
printf( "reference data %d = %f\n" , ii , dum_h->arr[ii] );
}
cudaMalloc( &dum_d , sizeof(dummy) * 1 );
cudaMemcpy( dum_d , dum_h , sizeof(dummy) * 1 , cudaMemcpyHostToDevice );
double *tmp;
cudaMalloc( &tmp , sizeof(double) * n );
cudaMemcpy( &( dum_d->arr ) , &tmp , sizeof(double*) , cudaMemcpyHostToDevice ); // copy the pointer (host) to the device structre to a device pointer
cudaMemcpy( tmp , dum_h->arr , sizeof(double) * n , cudaMemcpyHostToDevice );
delete [] dum_h->arr;
delete dum_h;
test<<< 1 , 1 >>>( dum_d , n );
gpu_err_chk( cudaDeviceSynchronize() );
cudaFree( tmp );
cudaFree( dum_d );
return 1;
}
However, I am still confused why this works. Does anyone have a visual diagram of what's going on? I am getting lost with the different pointers...
Moreover, there is one thing that really blows my mind: I can free tmp right before the kernel launch and the code still works, i.e.:
cudaFree( tmp );
test<<< 1 , 1 >>>( dum_d , n );
gpu_err_chk( cudaDeviceSynchronize() );
How is this the case? In my mind (clearly wrong), the device array containing the random values is gone...
Another point of confusion is that I can't free dum_d->arr directly (freeCuda(dum_d->arr)), this throws a segmentation fault.
#include <iostream>
using namespace std ;
#define min(x,y) (x>y?x:y)
#define N 33*1024
#define ThreadPerBlock 256
//smallest multiple of threadsPerBlock that is greater than or equal to N
#define blockPerGrid min(32 , (N+ThreadPerBlock-1) / ThreadPerBlock )
__global__ void Vector_Dot_Product ( const float *V1 , const float *V2 , float *V3 )
{
__shared__ float chache[ThreadPerBlock] ;
float temp ;
const unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x ;
const unsigned int chacheindex = threadIdx.x ;
while ( tid < N )
{
temp += V1[tid] * V2[tid] ;
tid += blockDim.x * gridDim.x ;
}
chache[chacheindex] = temp ;
__synchthreads () ;
int i = blockDim.x / 2 ;
while ( i!=0 )
{
if ( chacheindex < i )
chache[chacheindex] += chache [chacheindex + i] ;
__synchthreads () ;
i/=2 ;
}
if ( chacheindex == 0 )
V3[blockIdx.x] = chache [0] ;
}
int main ( int argv , char *argc )
{
float *V1_H , *V2_H , *V3_H ;
float *V1_D , *V2_D , *V3_D ;
V1_H = new float [N] ;
V2_H = new float [N] ;
V3_H = new float [blockPerGrid] ;
cudaMalloc ( (void **)&V1_D , N*sizeof(float)) ;
cudaMalloc ( (void **)&V2_D , N*sizeof(float)) ;
cudaMalloc ( (void **)&V3_D , blockPerGrid*sizeof(float)) ;
for ( int i = 0 ; i<N ; i++ )
{
V1_H[i] = i ;
V2_H[i] = i*2 ;
}
cudaMemcpy ( V1_D , V1_H , N*sizeof(float) , cudaMemcpyHostToDevice ) ;
cudaMemcpy ( V2_D , V2_H , N*sizeof(float) , cudaMemcpyHostToDevice ) ;
Vector_Dot_Product <<<blockPerGrid , ThreadPerBlock >>> (V1_D , V2_D , V3_D ) ;
cudaMemcpy ( V3_H , V3_D , N*sizeof(float) , cudaMemcpyDeviceToHost ) ;
cout <<"\n Vector Dot Prodcut is : " ;
float sum = 0 ;
for ( int i = 0 ; i<blockPerGrid ; i++ )
sum+=V3_H[i] ;
cout << sum << endl ;
cudaFree ( V1_D) ;
cudaFree ( V2_D) ;
cudaFree ( V3_D) ;
delete [] V1_H ;
delete [] V2_H ;
delete [] V3_H ;
}
please tell me what is the problem in this coding......i cant understand ....thanks in advance..
Regarding this:
identifier “__synchthreads” is undefined
Wherever you have this:
__synchthreads();
You should change it to this:
__syncthreads();
Regarding this:
expression must be a modifiable lvalue
Since you have defined tid as const here:
const unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x ;
You are not allowed to try and change it here:
tid += blockDim.x * gridDim.x ;
So the simplest solution might be to just drop the const from the tid definition:
unsigned int tid = blockDim.x * blockIdx.x + threadIdx.x ;
I am trying to replace some thrust calls to arrayfire to check the performance.
I am not sure if I am using properly arrayfire because the results I am taking do not match at all.
So , the thrust code for example I am using is:
cudaMalloc( (void**) &devRow, N * sizeof(float) );
...//devRow is filled
thrust::device_ptr<float> SlBegin( devRow );
for ( int i = 0; i < N; i++, SlBegin += PerSlElmts )
{
thrust::inclusive_scan( SlBegin, SlBegin + PerSlElmts, SlBegin );
}
cudaMemcpy( theRow, devRow, N * sizeof(float), cudaMemcpyDeviceToHost );
//use theRow...
Arrayfire:
af::array SlBegin( N , devRow );
for ( int i = 0;i < N; i++,SlBegin += PerSlElmts )
{
accum( SlBegin );
}
cudaMemcpy( theRow, devRow, N * sizeof(float), cudaMemcpyDeviceToHost );
//use theRow..
I am not sure how arrayfire handles the copy : af::array SlBegin( N , devRow ); .In thrust we have the device pointer which points from devRow to SlBegin , but in arrayfire..?
Also , I wanted to ask about using gfor .
In arrayfire webpage , it states that
Do not use this function directly; see GFOR: Parallel For-Loops.
And then for GFOR :
GFOR is disabled in the current version of ArrayFire
So , we can't use gfor?
---------UPDATE---------------------------
I have a small running example which shows the different results:
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <curand_kernel.h>
#include "arrayfire.h"
#include <thrust/scan.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
__global__ void Kernel( const int N ,float * const devRow )
{
int i = threadIdx.x;
if ( i < N )
devRow[ i ] = i;
}
int main(){
int N = 6;
int Slices = 2;
int PerSlElmts = 3;
float * theRow = (float*) malloc ( N * sizeof( float ));
for ( int i = 0; i < N; i ++ )
theRow[ i ] = 0;
// raw pointer to device memory
float * devRow;
cudaMalloc( (void **) &devRow, N * sizeof( float ) );
Kernel<<< 1,N >>>( N , devRow );
cudaDeviceSynchronize();
// wrap raw pointer with a device_ptr
thrust::device_ptr<float> SlBegin( devRow );
for ( int i = 0; i < Slices; i++ , SlBegin += PerSlElmts )
thrust::inclusive_scan( SlBegin, SlBegin + PerSlElmts , SlBegin );
cudaMemcpy( theRow, devRow, N * sizeof(float), cudaMemcpyDeviceToHost );
for ( int i = 0; i < N; i++ )
printf("\n Thrust accum : %f",theRow[ i ] );
//--------------------------------------------------------------------//
Kernel<<< 1,N >>>( N , devRow );
cudaDeviceSynchronize();
af::array SlBeginFire( N, devRow );
for ( int i = 0; i < Slices; i++ , SlBeginFire += PerSlElmts )
af::accum( SlBeginFire );
SlBeginFire.host( theRow );
for ( int i = 0; i < N; i++ )
printf("\n Arrayfire accum : %f",theRow[ i ] );
cudaFree( devRow );
free( theRow );
return 0;
}
It looks like you are trying to run a column-wise (0th-dim in ArrayFire) scan on a 2D array. Here is some code that you could use:
af::array SlBegin(N, devRow);
af::array result = accum(SlBegin, 0);
Here is a sample output
A [5 3 1 1]
0.7402 0.4464 0.7762
0.9210 0.6673 0.2948
0.0390 0.1099 0.7140
0.9690 0.4702 0.3585
0.9251 0.5132 0.6814
accum(A, 0) [5 3 1 1]
0.7402 0.4464 0.7762
1.6612 1.1137 1.0709
1.7002 1.2236 1.7850
2.6692 1.6938 2.1435
3.5943 2.2070 2.8249
This runs and inclusive scan on each column independently.
As for gfor, it has been added to the Open Source version of ArrayFire. As this code base is still a beta, improvements and fixes are taking place very rapidly. So keep a watch on our github page.
BEFORE reading below! :
As I have understand , when you call cublas from the kernel :
cublas calls are kernels themselves
the threads and blocks are managed from the cublas calls
a cublas call is launched by 1 thread ( and 1 block ) and then it is
checking the number of elements and shedules threads/blocks
automatically. So , you don't specify number of threads/blocks when
you run a cublas call.
I am launching a kernel with 1 thread and 1 block as I said above.
__global__ void (...)
{
...
cublasCtrsm( CublasHandle , CUBLAS_SIDE_LEFT ,CUBLAS_FILL_MODE_LOWER , CUBLAS_OP_N , CUBLAS_DIAG_NON_UNIT , M , N , &alpha , inCov, M , inSample, M )
for ( int i = 0; i < N; i++ )
cublasCdotc( CublasHandle , M , inCoil + i * M , 1 , inSample + i * M , 1 , devImage + i );
}
Now, this code works fine ( I am taking an image ) but the for loop takes too much time.I want to optimize this for loop.
So , I tried:
int i = threadIdx.x + blockDim.x * blockIdx.x;
if ( i < N )
cublasCdotc( CublasHandle , M , inCoil + i * M , 1 , inSample + i * M , 1 , devImage + i );
But , as I said I am calling the kernel with 1 thread and 1 block.
So , is going to be executed by 1 thread only,right?
(that's why I am not taking the image I want ,but only 1 pixel)
And this has as a concequence the expressions i * M not to be evaluated for all N.
My question is how to accomplish what I want?
For anyone who might understood the answer or want to find out , anyway...
I came with this solution.
In a global function:
int i = threadIdx.x + blockIdx.x * blockDim.x;
devImage[ i ] = 0;
if ( i < N )
{
for ( int j = 0; j < M; j++ )
{
devImage[ i ] += inCoil[ i * M + j ] * inSample[ i * M + j ] - inCoil[ i * M + j ] \
* inSample[ i * M + j ] + inCoil[ i * M + j ] * inSample[ i * M + j ] + inCoil[ i * M + j ] \
* inSample[ i * M + j ];
}
}
I did a small loop ( j < M ) instead of a big one ( M is much smaller than N).
Now , I can't think a way of using cublasCdotc running fast.
I'm trying to perform an inclusive scan to find the cumulative sum of an array. Following the advice given by harrism here, I'm using the procedure given here, but following the advice of those authors, I'm trying to write code that has each thread calculate 4 elements instead of one to mask memory latency.
I am staying away from thrust as performance is essential, and I need multi-stream capability. I have only just discovered CUB, and that will be my next effort, but I would like a multi-block solution and would also like to know where I've gone wrong on my existing code, just as an exercise to better understand CUDA.
The code below allocates 4 data elements to each block, where each block must have a multiple of 32 threads. My data will have a multiple of 128 threads so this restriction is acceptable to me. Enough shared memory is allocated to each block for the 4*blockDim.x elements plus an additional 32 elements to sum between warps. scanBlockAnyLength then adds the necessary offset to correct mismatch between warps, saving the final value of each warp to dev_blockSum in device global memory. sumWarp4_32 then scans this array to find the final to correct the mismatch between blocks, which is then added on in kernel_sumBlock
#include<cuda.h>
#include<iostream>
using std::cout;
using std::endl;
#define MAX_THREADS 1024
#define MAX_BLOCKS 65536
#define N 512
__device__ float sumWarp4_128(float* ptr, const int tidx = threadIdx.x) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 threads per warp
unsigned int i = warpid*128+lane; //first element of block data set this thread looks at
if( lane >= 1 ) ptr[i] += ptr[i-1];
if( lane >= 2 ) ptr[i] += ptr[i-2];
if( lane >= 4 ) ptr[i] += ptr[i-4];
if( lane >= 8 ) ptr[i] += ptr[i-8];
if( lane >= 16 ) ptr[i] += ptr[i-16];
if( lane==0 ) ptr[i+32] += ptr[i+31];
if( lane >= 1 ) ptr[i+32] += ptr[i+32-1];
if( lane >= 2 ) ptr[i+32] += ptr[i+32-2];
if( lane >= 4 ) ptr[i+32] += ptr[i+32-4];
if( lane >= 8 ) ptr[i+32] += ptr[i+32-8];
if( lane >= 16 ) ptr[i+32] += ptr[i+32-16];
if( lane==0 ) ptr[i+64] += ptr[i+63];
if( lane >= 1 ) ptr[i+64] += ptr[i+64-1];
if( lane >= 2 ) ptr[i+64] += ptr[i+64-2];
if( lane >= 4 ) ptr[i+64] += ptr[i+64-4];
if( lane >= 8 ) ptr[i+64] += ptr[i+64-8];
if( lane >= 16 ) ptr[i+64] += ptr[i+64-16];
if( lane==0 ) ptr[i+96] += ptr[i+95];
if( lane >= 1 ) ptr[i+96] += ptr[i+96-1];
if( lane >= 2 ) ptr[i+96] += ptr[i+96-2];
if( lane >= 4 ) ptr[i+96] += ptr[i+96-4];
if( lane >= 8 ) ptr[i+96] += ptr[i+96-8];
if( lane >= 16 ) ptr[i+96] += ptr[i+96-16];
return ptr[i+96];
}
__host__ __device__ float sumWarp4_32(float* ptr, const int tidx = threadIdx.x) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 elements per warp
unsigned int i = warpid*32+lane; //first element of block data set this thread looks at
if( lane >= 1 ) ptr[i] += ptr[i-1];
if( lane >= 2 ) ptr[i] += ptr[i-2];
if( lane >= 4 ) ptr[i] += ptr[i-4];
if( lane >= 8 ) ptr[i] += ptr[i-8];
if( lane >= 16 ) ptr[i] += ptr[i-16];
return ptr[i];
}
__device__ float sumBlock4(float* ptr, const int tidx = threadIdx.x, const int bdimx = blockDim.x ) {
const unsigned int lane = tidx & 31;
const unsigned int warpid = tidx >> 5; //32 threads per warp
float val = sumWarp4_128(ptr);
__syncthreads();//should be included
if( tidx==bdimx-1 ) ptr[4*bdimx+warpid] = val;
__syncthreads();
if( warpid==0 ) sumWarp4_32((float*)&ptr[4*bdimx]);
__syncthreads();
if( warpid>0 ) {
ptr[warpid*128+lane] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+32] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+64] += ptr[4*bdimx+warpid-1];
ptr[warpid*128+lane+96] += ptr[4*bdimx+warpid-1];
}
__syncthreads();
return ptr[warpid*128+lane+96];
}
__device__ void scanBlockAnyLength4(float *ptr, float* dev_blockSum, const float* dev_input, float* dev_output, const int idx = threadIdx.x, const int bdimx = blockDim.x, const int bidx = blockIdx.x) {
const unsigned int lane = idx & 31;
const unsigned int warpid = idx >> 5;
ptr[lane+warpid*128] = dev_input[lane+warpid*128+bdimx*bidx*4];
ptr[lane+warpid*128+32] = dev_input[lane+warpid*128+bdimx*bidx*4+32];
ptr[lane+warpid*128+64] = dev_input[lane+warpid*128+bdimx*bidx*4+64];
ptr[lane+warpid*128+96] = dev_input[lane+warpid*128+bdimx*bidx*4+96];
__syncthreads();
float val = sumBlock4(ptr);
__syncthreads();
dev_blockSum[0] = 0.0f;
if( idx==0 ) dev_blockSum[bidx+1] = ptr[bdimx*4-1];
dev_output[lane+warpid*128+bdimx*bidx*4] = ptr[lane+warpid*128];
dev_output[lane+warpid*128+bdimx*bidx*4+32] = ptr[lane+warpid*128+32];
dev_output[lane+warpid*128+bdimx*bidx*4+64] = ptr[lane+warpid*128+64];
dev_output[lane+warpid*128+bdimx*bidx*4+96] = ptr[lane+warpid*128+96];
__syncthreads();
}
__global__ void kernel_sumBlock(float* dev_blockSum, const float* dev_input, float* dev_output ) {
extern __shared__ float ptr[];
scanBlockAnyLength4(ptr,dev_blockSum,dev_input,dev_output);
}
__global__ void kernel_offsetBlocks(float* dev_blockSum, float* dev_arr) {
const int tidx = threadIdx.x;
const int bidx = blockIdx.x;
const int bdimx = blockDim.x;
const int lane = tidx & 31;
const int warpid = tidx >> 5;
if( warpid==0 ) sumWarp4_32(dev_blockSum);
float val = dev_blockSum[warpid];
dev_arr[warpid*128+lane] += val;
dev_arr[warpid*128+lane+32] += val;
dev_arr[warpid*128+lane+64] += val;
dev_arr[warpid*128+lane+96] += val;
}
void scan4( const float input[], float output[]) {
int blocks = 2;
int threadsPerBlock = 64; //multiple of 32
int smemsize = (threadsPerBlock*4+32)*sizeof(float);
float* dev_input, *dev_output;
cudaMalloc((void**)&dev_input,blocks*threadsPerBlock*4*sizeof(float));
cudaMalloc((void**)&dev_output,blocks*threadsPerBlock*4*sizeof(float));
float *dev_blockSum;
cudaMalloc((void**)&dev_blockSum,blocks*sizeof(float));
int offset = 0;
int Nrem = N;
int chunksize;
while( Nrem ) {
chunksize = max(Nrem,blocks*threadsPerBlock*4);
cudaMemcpy(dev_input,(void**)&input[offset],chunksize*sizeof(float),cudaMemcpyHostToDevice);
kernel_sumBlock<<<blocks,threadsPerBlock,smemsize>>>(dev_blockSum,dev_input,dev_output);
kernel_offsetBlocks<<<blocks,threadsPerBlock>>>(dev_blockSum,dev_output);
cudaMemcpy((void**)&output[offset],dev_output,chunksize*sizeof(float),cudaMemcpyDeviceToHost);
offset += chunksize;
Nrem -= chunksize;
}
cudaFree(dev_input);
cudaFree(dev_output);
}
int main() {
float h_vec[N], sol[N];
for( int i = 0; i < N; i++ ) h_vec[i] = (float)i+1.0f;
scan4(h_vec,sol);
cout << "solution:" << endl;
for( int i = 0; i < N; i++ ) cout << i << " " << (i+2)*(i+1)/2 << " " << sol[i] << endl;
return 0;
}
To my eye, the code is throwing errors because the lines in sumWarp4_128 are not executed in order within a warp. I.e, the if( lane==0 ) lines are executing before the other logical blocks that precede it. I thought this was not possible within a warp.
If I __syncthreads() before and after the lane==0 calls, I get some new exotic error that I just can't figure out.
Any help to point out where I've gone wrong would be appreciated
The code you are writing has race conditions due to not synchronizing between threads that are sharing data. While it is true that this can be done on current hardware for communication within a warp (so-called warp-synchronous programming), it is highly discouraged because the race conditions in the code could cause it to fail on possible future hardware.
While it is true that you will get higher performance by processing multiple items per thread, 4 is not a magic number -- you should make this a tunable parameter if possible. CUDPP uses 8 per thread, for example.
I would highly recommend that you use CUB for this. You should use cub::BlockLoad() to load multiple items per thread and cub::BlockScan() to scan them. Then you would just need some code to combine multiple blocks. The most bandwidth-efficient way to do this is to use the "Reduce-Scan-Scan" approach that Thrust uses. First reduce each block (cub::BlockReduce) and store the sum from each block to a blockSums array. Then scan that array to get the per-block offset. Then perform a cub::BlockScan on the blocks and add the previously computed per-block offset to each element.