Construct binary tree recursively in cuda - cuda

I want to build a binary tree in a vector s.t. parent's value would be the sum of its both children. To recursively build the tree in C would look like:
int construct(int elements[], int start, int end, int* tree, int index) {
if (start == end) {
tree[index] = elements[start];
return tree[index];
}
int middle = start + (end - start) / 2;
tree[index] = construct(elements, start, middle, tree, index*2) +
construct(elements, middle, end, tree, index*2+1);
return tree[index];
}
But I don't know how to build it in the CUDA in a parallel way by utilizing the thread. One reference I found useful is
How should we go about parallelizing this kind of recursive algorithm? One way is to use the approach presented by Garanzha et al., which processes the levels of nodes sequentially, starting from the root. The idea is to maintain a growing array of nodes in a breadth-first order, so that every level in the hierarchy corresponds to a linear range of nodes. On a given level, we launch one thread for each node that falls into this range. The thread starts by reading first and last from the node array and calling findSplit(). It then appends the resulting child nodes to the same node array using an atomic counter and writes out their corresponding sub-ranges. This process iterates so that each level outputs the nodes contained on the next level, which then get processed in the next round.
which process each level sequentially and parallelize the nodes at each level. I think it makes total sense, but I don't how to implement that exactly, can somebody give me an idea or example on how to do that?

I am not sure the indexing scheme described above would work.
Here is a sample code that could work: (though the tree indexing might not suit your needs):
__global__ void buildtreelevel(const int* elements, int count, int* tree)
{
int parentcount = (count + 1) >> 1;
for (int k = threadIdx.x + blockDim.x * blockIdx.x ; k < parentcount ; k += blockDim.x * gridDim.x)
{
if ((2*k+1) < count)
tree[k] = elements[k*2] + elements[k*2+1] ;
else
tree[k] = elements[k*2] ;
}
}
This function only processes one tree level at a time. The overall tree size is provided by :
int treesize (int count, int& maxlevel)
{
int res = 1 ;
while (count > 1)
{
count = (count + 1) >> 1 ;
res += count ;
++maxlevel;
}
return res ;
}
And building the whole tree requires several calls to the buildtreelevel kernel:
int buildtree (int grid, int block, const int* d_elements, int count, int** h_tree, int* d_data)
{
const int* ptr_elements = d_elements ;
int* ptr_data = d_data ;
int level = 0 ;
int levelcount = count ;
while (levelcount > 1)
{
buildtreelevel <<< grid, block >>> (ptr_elements, levelcount, ptr_data) ;
levelcount = (levelcount + 1) >> 1 ;
h_tree [level++] = ptr_data ;
ptr_elements = ptr_data ;
ptr_data += levelcount ;
}
return level ;
}
Synchronization only needs to occur at the end as all kernels are executed on stream 0.
int main()
{
int nElements = 10000000 ;
int* d_elements ;
int* d_data ;
int** h_tree ;
int maxlevel = 1 ;
cudaMalloc ((void**)&d_elements, nElements * sizeof (int)) ;
cudaMalloc ((void**)&d_data, treesize(nElements, maxlevel) * sizeof (int)) ;
h_tree = new int*[maxlevel];
buildtree (64, 256, d_elements, nElements, h_tree, d_data) ;
cudaError_t res = cudaDeviceSynchronize() ;
if (cudaSuccess != res)
fprintf (stderr, "ERROR (%d) : %s \n", res, cudaGetErrorString(res));
cudaDeviceReset();
}
Your tree structure is stored in h_tree, which is a host array of device pointers.
This is not optimal, but might be a good start (using aligned int4 with __ldg) and processing 4 levels at a time might improve performance.

Related

What is wrong with my understanding of "__shared__" variables in cuda?

After reading the manual of NVIDIA, I wrotea parrell reduction code as follows:
__global__ void kernel(int *devData)
{
__shared__ int sum;
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (threadIdx.x == 0)
sum = 0;
__syncthreads();
sum += devData[i];
__syncthreads();
if (threadIdx.x == 0)
printf("sum of block %d is %d\n", blockIdx.x, sum);
}
int main(void)
{
// init device
int devIdx = 0;
cudaError_t err = cudaSuccess;
gpuDeviceInit(devIdx);
int i;
int data[100];
int *devData;
for (i = 0; i < 100; i++)
data[i] = 1;
err = cudaMalloc(&devData, 100 * sizeof(int));
checkCudaErrors(err);
// copy data to device
err = cudaMemcpy(devData, data, 100 * sizeof(int), cudaMemcpyHostToDevice);
checkCudaErrors(err);
int blocksPerGrid = 10;
int threadsPerBlock = 10;
// call kernel function
kernel <<<blocksPerGrid, threadsPerBlock>>> (devData);
checkCudaErrors(cudaGetLastError());
cudaDeviceReset();
return 0;
}
I'm trying to sum integers for each block and then print this sum.
But I found the result was as follows:
sum of block 0 is 1
sum of block 6 is 1
sum of block 2 is 1
sum of block 8 is 1
sum of block 1 is 1
sum of block 7 is 1
sum of block 4 is 1
sum of block 3 is 1
sum of block 9 is 1
sum of block 5 is 1
The result I expected was 10.Is the __shared__ variable "sum" shared by every thread in a block? What's wrong with my understanding of "__shared__" variables in cuda?
you have multiple threads trying to access (read-modify-write) sum at the same time, here:
sum += devData[i];
This doesn't work for either global or shared data in CUDA (i.e. CUDA won't sort that out for you, automatically). To sort this out, the usual approaches are either to use atomics or else to use a canonical parallel reduction
There are numerous questions on both of these topics here on the cuda SO tag, and you can get some focused training on parallel reduction methods in unit 5 of this online training series.
For example, in your code, a trivial change to "fix" would be to replace the above line of code with an atomic add:
atomicAdd(&sum,devData[i]);
atomics force serialization, so a preferred approach is a canonical parallel reduction.

CUDA's nvvp reports non-ideal memory access pattern, but bandwidth is almost peaking

EDIT: new minimal working example to illustrate the question and better explanation of nvvp's outcome (following suggestions given in the comments).
So, I have crafted a "minimal" working example, which follows:
#include <cuComplex.h>
#include <iostream>
int const n = 512 * 100;
typedef float real;
template < class T >
struct my_complex {
T x;
T y;
};
__global__ void set( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 1.0f, 0.0f };
}
__global__ void duplicate_whole( my_complex< real > * a )
{
my_complex< real > & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d = { 2.0f * d.x, 2.0f * d.y };
}
__global__ void duplicate_half( real * a )
{
real & d = a[ blockIdx.x * 1024 + threadIdx.x ];
d *= 2.0f;
}
int main()
{
my_complex< real > * a;
cudaMalloc( ( void * * ) & a, sizeof( my_complex< real > ) * n * 1024 );
set<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_whole<<< n, 1024 >>>( a );
cudaDeviceSynchronize();
duplicate_half<<< 2 * n, 1024 >>>( reinterpret_cast< real * >( a ) );
cudaDeviceSynchronize();
my_complex< real > * a_h = new my_complex< real >[ n * 1024 ];
cudaMemcpy( a_h, a, sizeof( my_complex< real > ) * n * 1024, cudaMemcpyDeviceToHost );
std::cout << "( " << a_h[ 0 ].x << ", " << a_h[ 0 ].y << " )" << '\t' << "( " << a_h[ n * 1024 - 1 ].x << ", " << a_h[ n * 1024 - 1 ].y << " )" << std::endl;
return 0;
}
When I compile and run the above code, kernels duplicate_whole and duplicate_half take just about the same time to run.
However, when I analyze the kernels using nvvp I get different reports for each of the kernels in the following sense. For kernel duplicate_whole, nvvp warns me that at line 23 (d = { 2.0f * d.x, 2.0f * d.y };) the kernel is performing
Global Load L2 Transaction/Access = 8, Ideal Transaction/Access = 4
I agree that I am loading 8 byte words. What I do not understand is why 4 bytes is the ideal word size. In special, there is no performance difference between the kernels.
I suppose that there must be circumstances where this global store access pattern could cause performance degradation. What are these?
And why is that I do not get a performance hit?
I hope that this edit has clarified some unclear points.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I'll start wit some kernel code to exemplify my question, which will follow below
template < class data_t >
__global__ void chirp_factors_multiply( std::complex< data_t > const * chirp_factors,
std::complex< data_t > * data,
int M,
int row_length,
int b,
int i_0
)
{
#ifndef CUGALE_MUL_SHUFFLE
// Output array length:
int plane_area = row_length * M;
// Process element:
int i = blockIdx.x * row_length + threadIdx.x + i_0;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
my_complex< data_t > datum;
my_complex< data_t > datum_new;
for ( int i_b = 0; i_b < b; ++ i_b )
{
my_complex< data_t > & ref_datum = ref_complex( data[ i_b * plane_area + i ] );
datum = ref_datum;
datum_new.x = datum.x * chirp_factor.x - datum.y * chirp_factor.y;
datum_new.y = datum.x * chirp_factor.y + datum.y * chirp_factor.x;
ref_datum = datum_new;
}
#else
// Output array length:
int plane_area = row_length * M;
// Element to process:
int i = blockIdx.x * row_length + ( threadIdx.x + i_0 ) / 2;
my_complex< data_t > const chirp_factor = ref_complex( chirp_factors[ i ] );
// Real and imaginary part of datum (not respectively for odd threads):
data_t datum_a;
data_t datum_b;
// Even TIDs will read data in regular order, odd TIDs will read data in inverted order:
int parity = ( threadIdx.x % 2 );
int shuffle_dir = 1 - 2 * parity;
int inwarp_tid = threadIdx.x % warpSize;
for ( int i_b = 0; i_b < b; ++ i_b )
{
int data_idx = i_b * plane_area + i;
datum_a = reinterpret_cast< data_t * >( data + data_idx )[ parity ];
datum_b = __shfl_sync( 0xFFFFFFFF, datum_a, inwarp_tid + shuffle_dir, warpSize );
// Even TIDs compute real part, odd TIDs compute imaginary part:
reinterpret_cast< data_t * >( data + data_idx )[ parity ] = datum_a * chirp_factor.x - shuffle_dir * datum_b * chirp_factor.y;
}
#endif // #ifndef CUGALE_MUL_SHUFFLE
}
Let us consider the case where data_t is float, which is memory bandwidth limited. As it can be seen above, there are two versions of the kernel, one which reads/writes 8 bytes (a whole complex number) per thread and another which reads/writes 4 bytes per thread and then shuffles the results so the complex product is computed correctly.
The reason why I have written the version using shuffle is because nvvp insisted that reading 8 bytes per thread was not the best idea because this memory access pattern would be inefficient. This is the case even though in both systems tested (GTX 1050 and GTX Titan Xp) memory bandwidth was very close to theoretical maximum.
Surely enough I knew that no improvement was likely to happen, and this was indeed the case: both kernels take pretty much the same time to run. So, my question is the following:
Why is that nvvp reports that reading 8 bytes would be less efficient than reading 4 bytes per thread? In which circumstances would that be the case?
As a side note, single precision is more important to me, but double is useful in some cases too. Interestingly enough, in the case where data_t is double, there is no execution time difference too between the two kernel versions, even though in this case the kernel is compute bound and the shuffle version performs some more flops than the original version.
Note: the kernels are applied to a row_length * M * b dataset (b images with row_length columns and M lines) and the chirp_factor array is row_length * M. Both kernels run perfecly fine (I can edit the question to show you the calls to both versions if you have doubts about it).
The issue here has to do with how the compiler is processing your code. nvvp is merely dutifully reporting what is happening when you run your code.
If you use the cuobjdump -sass tool on your executable, you will discover that the duplicate_whole routine is doing two 4-byte loads and two 4-byte stores. This is not optimal, partly becuase there is a stride in each load and store (each load and store touches alternate elements in memory).
The reason for this is that the compiler does not know the alignment of your my_complex struct. Your struct would be legal for use in situations that would prevent the compiler from generating a (legal) 8-byte load. As discussed here we can fix this by informing the compiler that we only intend to use the struct in alignment scenarios where a CUDA 8-byte load is legal (i.e. it is "naturally aligned"). The modification to your struct looks like this:
template < class T >
struct __align__(8) my_complex {
T x;
T y;
};
With that change to your code, the compiler generates 8-byte loads for the duplicate_whole kernel, and you should see a different report from the profiler. You should use this sort of decoration only when you understand what it means and are willing to enter into a contract with the compiler that you will ensure this is the case. If you do something unusual, like unusual pointer casting, you can violate your end of the bargain and generate a machine fault.
The reason you don't see much performance difference almost certainly has to do with CUDA load/store behavior and the GPU caches
When you do a strided load, the GPU loads an entire cacheline anyway, even though (in this case) you only need half the elements (the real elements) for that particular load operation. However you need the other half of the elements (the imaginary elements) anyway; they will be loaded on the next instruction, and this instruction most likely hits in the cache, due to the previous load.
On a strided store in this case, writing strided elements in one instruction and the alternate elements in the next instruction will end up using one of the caches as a "coalescing buffer". This isn't coalescing in the typical sense used in CUDA terminology; that sort of coalescing only applies to a single instruction. However the cache "coalescing buffer" behavior allows it to "accumulate" multiple writes to an already-resident line, before that line gets written out or evicted. This is approximately equivalent to "write-back" cache behavior.

Shared memory, branching performance and register count

I came across some peculiar performance behaviour when trying out the CUDA shuffle instruction. The test kernel below is based on an image processing algorithm which adds input-dependent values to all neighbouring pixels within a square of side rad. The output for each block is added in shared memory. If only one thread per warp adds its result to shared memory, the performance is poor (Option 1), whereas on the other hand, if all threads add to shared memory (one thread adds the desired value, the rest just add 0), the execution time drops by 2-3 times (Option 2).
#include <iostream>
#include "cuda_runtime.h"
#define warpSz 32
#define tileY 32
#define rad 32
__global__ void test(float *out, int pitch)
{
// Set shared mem to 0
__shared__ float tile[(warpSz + 2*rad) * (tileY + 2*rad)];
for (int i = threadIdx.y*blockDim.x+threadIdx.x; i<(tileY+2*rad)*(warpSz+2*rad); i+=blockDim.x*blockDim.y) {
tile[i] = 0.0f;
}
__syncthreads();
for (int row=threadIdx.y; row<tileY; row += blockDim.y) {
// Loop over pixels in neighbourhood
for (int i=0; i<2*rad+1; ++i) {
float res = 0.0f;
int rowStartIdx = (row+i)*(warpSz+2*rad);
for (int j=0; j<2*rad+1; ++j) {
res += float(threadIdx.x+row); // Substitute for real calculation
// Option 1: one thread writes to shared mem
if (threadIdx.x == 0) {
tile[rowStartIdx + j] += res;
res = 0.0f;
}
//// Option 2: all threads write to shared mem
//float tmp = 0.0f;
//if (threadIdx.x == 0) {
// tmp = res;
// res = 0.0f;
//}
//tile[rowStartIdx + threadIdx.x+j] += tmp;
res = __shfl(res, (threadIdx.x+1) % warpSz);
}
res += float(threadIdx.x+row);
tile[rowStartIdx + threadIdx.x+2*rad] += res;
__syncthreads();
}
}
// Add result back to global mem
for (int row=threadIdx.y; row<tileY+2*rad; row+=blockDim.y) {
for (int col=threadIdx.x; col<warpSz+2*rad; col+=warpSz) {
int idx = (blockIdx.y*tileY + row)*pitch + blockIdx.x*warpSz + col;
atomicAdd(out+idx, tile[row*(warpSz+2*rad) + col]);
}
}
}
int main(void)
{
int2 dim = make_int2(512, 512);
int pitchOut = (((dim.x+2*rad)+warpSz-1) / warpSz) * warpSz;
int sizeOut = pitchOut*(dim.y+2*rad);
dim3 gridDim((dim.x+warpSz-1)/warpSz, (dim.y+tileY-1)/tileY, 1);
float *devOut;
cudaMalloc((void**)&devOut, sizeOut*sizeof(float));
cudaEvent_t start, stop;
float elapsedTime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaFree(0);
cudaEventRecord(start, 0);
test<<<gridDim, dim3(warpSz, 8)>>>(devOut, pitchOut);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaFree(devOut);
cudaDeviceReset();
std::cout << "Elapsed time: " << elapsedTime << " ms.\n";
std::cin.ignore();
}
Is this expected behaviour/can anyone explain why this happens?
One thing I have noted is that Option 1 uses only 15 registers, whereas Option 2 uses 37, which seems a big difference to me.
Another is that the if-statement in the innermost loop is converted to explicit bra instructions in the PTX code for Option 1, whereas for Option 2 it is converted to two selp instructions. Could it be that the explicit branching is behind the 2-3 times slow down similar to what's suspected in this question?
There are two reasons why I am reluctant to go for Option 2. First, when profiling the original application it seems to be limited by share memory bandwidth, which indicates that there is potential to increase the performance by having fewer threads accessing it. Second, unless we use the volatile keyword, writes to shared memory can be optimised to registers. Since we are only interested in the contribution from last the thread to access each memory location (threadIdx.x == 0), and all others add 0, this is not a problem as long as all changes temporarily located in registers are guaranteed to be written back to shared memory in the same order they were issued. Is this the case though? (This far, both options have produced the exact same result.)
Any thoughts or ideas are much appreciated!
PS. I compile for compute capability 3.0. (However, the shuffle instruction is not necessary to demonstrate the behaviour and can be commented out.)

CUDA not so fast against CPU with OpenMP?

I am trying to compute cross-correlation amongst 450 vectors each of size 20000.
While doing this on CPU i stored the data in 2D matrix with rows=20000 and cols=450.
The serial code for the computation looks like
void computeFF_cpu( float * nSamples, float * nFeatures, float ** data, float ** corr
#pragma omp parallel for shared(corr, data)
for( int i=0 ; i<nFeatures ; i++ )
{
for( int j=0 ; j<nFeatures ; j++ )
corr[i][j] = pearsonCorr( data[i], data[j], nSamples );
}
int main()
{
.
.
**for( int z=0 ; z<1000 ; z++ )**
computeFF_cpu( 20000, 450, data, corr );
.
.
}
This works perfectly. Now I have attempted to solve this problem with GPU. I have converted the 2D data matrix into row-major format in GPU memory and I have verified that the copy is correctly made.
The vectors are stored as a matrix of size 900000 (ie. 450*20000) in row major format. Organized as follows
<---nSamples of f1---><---nSamples of f2 ---><---nSamples of f3--->......
My cuda code to compute cross-correlation is as follows
// kernel for computation of ff
__global__ void computeFFCorr(int nSamples, int nFeatures, float * dev_data, float * dev_ff)
{
int tid = blockIdx.x + blockIdx.y*gridDim.x;
if( blockIdx.x == blockIdx.y )
dev_ff[tid] = 1.0;
else if( tid < nFeatures*nFeatures )
dev_ff[tid] = pearsonCorrelationScore_gpu( dev_data+(blockIdx.x*nSamples), dev_data+(blockIdx.y*nSamples), nSamples );
}
main()
{
.
.
// Call kernel for computation of ff
**for( int z=0 ; z<1000 ; z++ )**
computeFFCorr<<<dim3(nFeatures,nFeatures),1>>>(nSamples, nFeatures, dev_data, corr);
//nSamples = 20000
// nFeatures = 450
// dev_data -> data matrix in row major form
// corr -> result matrix also stored in row major
.
.
}
Seems like I have found an answer to my own question. I have the following experiment. I have changed the values of z (ie. number of times the function is executed). This kind of approach was suggested in several of previous post on stackoverflow under cuda tag.
Here is the table --
Z=100 ; CPU=11s ; GPU=14s
Z=200 ; CPU=18s ; GPU=23s
Z=300 ; CPU=26s ; GPU=34s
Z=500 ; CPU=41s ; GPU=53s
Z=1000; CPU=99s ; GPU=101s
Z=1500; CPU=279s; GPU=150s
Z=2000; CPU=401s; GPU=203s
It is evident that as the number of computation grows GPU is able to scale much better than the CPU.

CUDA kernel call in a simple sample

It's the first parallel code of cuda by example .
Can any one describe me about the kernel call : <<< N , 1 >>>
This is the code with important points :
#define N 10
__global__ void add( int *a, int *b, int *c ) {
int tid = blockIdx.x; // this thread handles the data at its thread id
if (tid < N)
c[tid] = a[tid] + b[tid];
}
int main( void ) {
int a[N], b[N], c[N];
int *dev_a, *dev_b, *dev_c;
// allocate the memory on the GPU
// fill the arrays 'a' and 'b' on the CPU
// copy the arrays 'a' and 'b' to the GPU
add<<<N,1>>>( dev_a, dev_b, dev_c );
// copy the array 'c' back from the GPU to the CPU
// display the results
// free the memory allocated on the GPU
return 0;
}
Why it used of <<< N , 1 >>> that it means we used of N blocks and 1 thread in each block ?? since we can write this <<< 1 , N >>> and used 1 block and N thread in this block for more optimization.
For this little example, there is no particular reason (as Bart already told you in the comments). But for a larger, more realistic example you should always keep in mind that the number of threads per block is limited. That is, if you use N = 10000, you could not use <<<1,N>>> anymore, but <<<N,1>>> would still work.