CUDA: streaming the same memory location to all threads - cuda

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??

This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:
// choose some reasonable block size
const unsigned int block_size = 256;
__global__ void kernel(double *ptr)
{
__shared__ double window[block_size];
// cooperate with my block to load block_size elements
window[threadIdx.x] = ptr[threadIdx.x];
// wait until the window is full
__syncthreads();
// operate on the data
...
}
You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.

Related

How can I make sure the compiler parallelizes my loads from global memory?

I've written a CUDA kernel that looks something like this:
int tIdx = threadIdx.x; // Assume a 1-D thread block and a 1-D grid
int buffNo = 0;
for (int offset=buffSz*blockIdx.x; offset<totalCount; offset+=buffSz*gridDim.x) {
// Select which "page" we're using on this iteration
float *buff = &sharedMem[buffNo*buffSz];
// Load data from global memory
if (tIdx < nLoadThreads) {
for (int ii=tIdx; ii<buffSz; ii+=nLoadThreads)
buff[ii] = globalMem[ii+offset];
}
// Wait for shared memory
__syncthreads();
// Perform computation
if (tIdx >= nLoadThreads) {
// Perform some computation on the contents of buff[]
}
// Switch pages
buffNo ^= 0x01;
}
Note that there's only one __syncthreads() in the loop, so the first nLoadThreads threads will start loading the data for the 2nd iteration while the rest of the threads are still computing the results for the 1st iteration.
I was thinking about how many threads to allocate for loading vs. computing, and I reasoned that I would only need a single warp for loading, regardless of buffer size, because that inner for loop consists of independent loads from global memory: they can all be in flight at the same time. Is this a valid line of reasoning?
And yet when I try this out, I find that (1) increasing the # of load warps dramatically increases performance, and (2) the disassembly in nvvp shows that buff[ii] = globalMem[ii+offset] was compiled into a load from global memory followed 2 instructions later by a store to shared memory, indicating that the compiler is not applying instruction-level parallelism here.
Would additional qualifiers (const, __restrict__, etc) on buff or globalMem help ensure the compiler does what I want?
I suspect the problem has to do with the fact that buffSz is not known at compile-time (the actual data is 2-D and the appropriate buffer size depends on the matrix dimensions). In order to do what I want, the compiler will need to allocate a separate register for each LD operation in flight, right? If I manually unroll the loop, the compiler re-orders the instructions so that there are a few LD in flight before the corresponding ST needs to access that register. I tried a #pragma unroll but the compiler only unrolled the loop without reordering the instructions, so that didn't help. What else can I do?
The compiler has no chance to reorder stores to shared memory away from loads from global memory, because a __syncthreads() barrier is immediately following.
As all off the threads have to wait at the barrier anyway, it is faster to use more threads for loading. This means that more global memory transactions can be in flight at any time, and each load thread has to incur global memory latency less often.
All CUDA devices so far do not support out-of-order execution, so the load loop will incur exactly one global memory latency per loop iteration, unless the compiler can unroll it and reorder loads before stores.
To allow full unrolling, the number of loop iterations needs to be known at compile time. You can use talonmies' suggestion of templating the loop trips to achieve this.
You can also use partial unrolling. Annotating the load loop with #pragma unroll 2 will allow the compiler to issue two loads, then two stores for every two loop iterations, thus achieve a similar effect to doubling nLoadThreads. Replacing 2 with higher numbers is possible, but you will hit the maximum number of transactions in flight at some point (use float2 or float4 moves to transfer more data with the same number of transactions). Also it is difficult to predict whether the compiler will prefer reordering instructions over the cost of more complex code for the final, potentially partial, trip through the unrolled loop.
So the suggestions are:
Use as many load threads as possible.
Unroll the load loop by templating the number of loop iterations and instantiating it for all possible number of loop trips (or the most common ones, with a generic fallback), or by using partial loop unrolling.
If the data is suitably aligned, move it as float2 or float4 to move more data with the same number of transactions.

Benefit of splitting a big CUDA kernel and using dynamic parallelism

I have a big kernel in which an initial state is evolved using different techniques. That is, I have a loop in the kernel, in this loop a certain predicate is evaluated on the current state and on the result of this predicate, a certain action is taken.
The kernel needs a bit of temporary data and shared memory, but since it is big it uses 63 registers and the occupancy is very very low.
I would like to split the kernel in many little kernels, but every block is totally independent from the others and I (think I) can't use a single thread on the host code to launch multiple small kernels.
I am not sure if streams are adequate for this kind of work, I never used them, but since I have the option to use the dynamic parallelism, I would like if that is a good option to implement this kind of job.
Is it fast to launch a kernel from a kernel?
Do I need to copy data in global memory to make them available to a sub-kernel?
If I split my big kernel in many little ones, and leave the first kernel with a main loop which calls the required kernel when necessary (which allows me to move temporary variables in every sub-kernel), will help me increase the occupancy?
I know it is a bit generic question, but I do not know this technology and I would like if it fits my case or if streams are better.
EDIT:
To provide some other details, you can imagine my kernel to have this kind of structure:
__global__ void kernel(int *sampleData, int *initialData) {
__shared__ int systemState[N];
__shared__ int someTemp[N * 3];
__shared__ int time;
int tid = ...;
systemState[tid] = initialData[tid];
while (time < TIME_END) {
bool c = calc_something(systemState);
if (c)
break;
someTemp[tid] = do_something(systemState);
c = do_check(someTemp);
if (__syncthreads_or(c))
break;
sample(sampleData, systemState);
if (__syncthreads_and(...)) {
do_something(systemState);
sync();
time += some_increment(systemState);
}
else {
calcNewTemp(someTemp, systemState);
sync();
do_something_else(someTemp, systemState);
time += some_other_increment(someTemp, systemState);
}
}
do_some_stats();
}
this is to show you that there is a main loop, that there are temporary data which are used somewhere and not in other points, that there are shared data, synchronization points, etc.
Threads are used to compute vectorial data, while there is, ideally, one single loop in each block (well, of course it is not true, but logically it is)... One "big flow" for each block.
Now, I am not sure about how to use streams in this case... Where is the "big loop"? On the host I guess... But how do I coordinate, from a single loop, all the blocks? This is what leaves me most dubious. May I use streams from different host threads (One thread per block)?
I am less dubious about dynamic parallelism, because I could easily keep the big loop running, but I am not sure if I could have advantages here.
I have benefitted from dynamic parallelism for solving an interpolation problem of the form:
int i = threadIdx.x + blockDim.x * blockIdx.x;
for(int m=0; m<(2*K+1); m++) {
PP1 = calculate_PP1(i,m);
phi_cap1 = calculate_phi_cap1(i,m);
for(int n=0; n<(2*K+1); n++) {
PP2 = calculate_PP2(i,m);
phi_cap2 = calculate_phi_cap2(i,n);
atomicAdd(&result[PP1][PP2],data[i]*phi_cap1*phi_cap2); } } }
where K=6. In this interpolation problem, the computation of each addend is independent of the others, so I have split them in a (2K+1)x(2K+1) kernel.
From my (possibly incomplete) experience, dynamic parallelism will help if you have a few number of independent iterations. For larger number of iterations, perhaps you could end up by calling the child kernel several times and so you should check if the overhead in kernel launch will be the limiting factor.

Very poor memory access performance with CUDA

I'm very new to CUDA, and trying to write a test program.
I'm running the application on GeForce GT 520 card, and get VERY poor performance.
The application is used to process some image, with each row being handled by a separate thread.
Below is a simplified version of the application. Please note that in the real application, all constants are actually variables, provided be the caller.
When running the code below, it takes more than 20 seconds to complete the execution.
But as opposed to using malloc/free, when l_SrcIntegral is defined as a local array (as it appears in the commented line), it takes less than 1 second to complete the execution.
Since the actual size of the array is dynamic (and not 1700), this local array can't be used in the real application.
Any advice how to improve the performance of this rather simple code would be appreciated.
#include "cuda_runtime.h"
#include <stdio.h>
#define d_MaxParallelRows 320
#define d_MinTreatedRow 5
#define d_MaxTreatedRow 915
#define d_RowsResolution 1
#define k_ThreadsPerBlock 64
__global__ void myKernel(int Xi_FirstTreatedRow)
{
int l_ThreadIndex = blockDim.x * blockIdx.x + threadIdx.x;
if (l_ThreadIndex >= d_MaxParallelRows)
return;
int l_Row = Xi_FirstTreatedRow + (l_ThreadIndex * d_RowsResolution);
if (l_Row <= d_MaxTreatedRow) {
//float l_SrcIntegral[1700];
float* l_SrcIntegral = (float*)malloc(1700 * sizeof(float));
for (int x=185; x<1407; x++) {
for (int i=0; i<1700; i++)
l_SrcIntegral[i] = i;
}
free(l_SrcIntegral);
}
}
int main()
{
cudaError_t cudaStatus;
cudaStatus = cudaSetDevice(0);
int l_ThreadsPerBlock = k_ThreadsPerBlock;
int l_BlocksPerGrid = (d_MaxParallelRows + l_ThreadsPerBlock - 1) / l_ThreadsPerBlock;
int l_FirstRow = d_MinTreatedRow;
while (l_FirstRow <= d_MaxTreatedRow) {
printf("CUDA: FirstRow=%d\n", l_FirstRow);
fflush(stdout);
myKernel<<<l_BlocksPerGrid, l_ThreadsPerBlock>>>(l_FirstRow);
cudaDeviceSynchronize();
l_FirstRow += (d_MaxParallelRows * d_RowsResolution);
}
printf("CUDA: Done\n");
return 0;
}
1.
As #aland said, you will maybe even encounter worse performance calculating just one row in each kernel call.
You have to think about processing the whole input, just to theoretically use the power of the massive parallel processing.
Why start multiple kernels with just 320 threads just to calculate one row?
How about using as many blocks you have rows and let the threads per block process one row.
(320 threads per block is not a good choice, check out how to reach better occupancy)
2.
If your fast resources as registers and shared memory are not enough, you have to use a tile apporach which is one of the basics using GPGPU programming.
Separate the input data into tiles of equal size and process them in a loop in your thread.
Here I posted an example of such a tile approach:
Parallelization in CUDA, assigning threads to each column
Be aware of range checks in that tile approach!
Example to give you the idea:
Calculate the sum of all elements in a column vector in an arbitrary sized matrix.
Each block processes one column and the threads of that block store in a tile loop their elements in a shared memory array. When finished they calculate the sum using parallel reduction, just to start the next iteration.
At the end each block calculated the sum of its vector.
You can still use dynamic array sizes using shared memory. Just pass a third argument in the <<<...>>> of the kernel call. That'd be the size of your shared memory per block.
Once you're there, just bring all relevant data into your shared array (you should still try to keep coalesced accesses) bringing one or several (if it's relevant to keep coalesced accesses) elements per thread. Sync threads after it's been brought (only if you need to stop race conditions, to make sure the whole array is in shared memory before any computation is done) and you're good to go.
Also: you should tessellate using blocks and threads, not loops. I understand that's just an example using a local array, but still, it could be done tessellating through blocks/threads and not nested for loops (which are VERY bad for performance!) I hope you're running your sample code using just 1 block and 1 thread, otherwise it wouldn't make much sense.

How to share a common value between threads in a given block?

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!
Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.
If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).

Copying whole global memory buffer many times to shared memory buffer

I have a buffer in global memory that I want to copy in shared memory for each block as to speed up my read-only access. Each thread in each block will use the whole buffer at different positions concurrently.
How does one do that?
I know the size of the buffer only at run time:
__global__ void foo( int *globalMemArray, int N )
{
extern __shared__ int s_array[];
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if( idx < N )
{
...?
}
}
The first point to make is that shared memory is limited to a maximum of either 16kb or 48kb per streaming multiprocessor (SM), depending on which GPU you are using and how it is configured, so unless your global memory buffer is very small, you will not be able to load all of it into shared memory at the same time.
The second point to make is that the contents of shared memory only has the scope and lifetime of the block it is associated with. Your sample kernel only has a single global memory argument, which makes me think that you are either under the misapprehension that the contents of a shared memory allocation can be preserved beyond the life span of the block that filled it, or that you intend to write the results of the block calculations back into same global memory array from which the input data was read. The first possibility is wrong and the second will result in memory races and inconsistant results. It is probably better to think of shared memory as a small, block scope L1 cache which is fully programmer managed than some sort of faster version of global memory.
With those points out of the way, a kernel which loaded sucessive segments of a large input array, processed them and then wrote some per thread final result back input global memory might look something like this:
template <int blocksize>
__global__ void foo( int *globalMemArray, int *globalMemOutput, int N )
{
__shared__ int s_array[blocksize];
int npasses = (N / blocksize) + (((N % blocksize) > 0) ? 1 : 0);
for(int pos = threadIdx.x; pos < (blocksize*npasses); pos += blocksize) {
if( pos < N ) {
s_array[threadIdx.x] = globalMemArray[pos];
}
__syncthreads();
// Calculations using partial buffer contents
.......
__syncthreads();
}
// write final per thread result to output
globalMemOutput[threadIdx.x + blockIdx.x*blockDim.x] = .....;
}
In this case I have specified the shared memory array size as a template parameter, because it isn't really necessary to dynamically allocate the shared memory array size at runtime, and the compiler has a better chance at performing optimizations when the shared memory array size is known at compile time (perhaps in the worst case there could be selection between different kernel instances done at run time).
The CUDA SDK contains a number of good example codes which demonstrate different ways that shared memory can be used in kernels to improve memory read and write performance. The matrix transpose, reduction and 3D finite difference method examples are all good models of shared memory usage. Each also has a good paper which discusses the optimization strategies behind the shared memory use in the codes. You would be well served by studying them until you understand how and why they work.