How to share a common value between threads in a given block? - cuda

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!

Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.

If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).

Related

The shared memory size is limited to the maximum thread number when using AtomicAdd function

I use atomic operation to calculate summation of the values like histogram.
So, I use shared memory first to store the value in the block, and the values stored at the shared memory in each block are saved in the global memory next.
The whole code is follows.
__global__ void KERNEL_RIGID_force_sum(part1*P1,part3*P3,int_t*nop_sol,Real*xcm,Real*ycm,Real*zcm,Real*sum_fx,Real*sum_fy,Real*sum_fz)
{
int_t i=threadIdx.x+blockIdx.x*blockDim.x;
if(i>=k_num_part2) return;
if(P1[i].i_type==3) return;
// if(P1[i].p_type<RIGID) return;
// initilalize accumulation array in shared memory
__shared__ int_t tmp_nop[128];
__shared__ Real tmp_xcm[128],tmp_ycm[128],tmp_zcm[128];
__shared__ Real tmp_fx[128],tmp_fy[128],tmp_fz[128];
tmp_nop[threadIdx.x]=0;
tmp_xcm[threadIdx.x]=0;
tmp_ycm[threadIdx.x]=0;
tmp_zcm[threadIdx.x]=0;
tmp_fx[threadIdx.x]=0;
tmp_fy[threadIdx.x]=0;
tmp_fz[threadIdx.x]=0;
__syncthreads();
Real xi,yi,zi;
Real fxi,fyi,fzi;
int_t ptypei;
ptypei=P1[i].p_type;
xi=P1[i].x;
yi=P1[i].y;
zi=P1[i].z;
fxi=P3[i].ftotalx;
fyi=P3[i].ftotaly;
fzi=P3[i].ftotalz;
// save values to shared memory
atomicAdd(&tmp_nop[ptypei],1);
atomicAdd(&tmp_xcm[ptypei],xi);
atomicAdd(&tmp_ycm[ptypei],yi);
atomicAdd(&tmp_zcm[ptypei],zi);
atomicAdd(&tmp_fx[ptypei],fxi);
atomicAdd(&tmp_fy[ptypei],fyi);
atomicAdd(&tmp_fz[ptypei],fzi);
__syncthreads();
// save shared memory values to global memory
atomicAdd(&nop_sol[threadIdx.x],tmp_nop[threadIdx.x]);
atomicAdd(&xcm[threadIdx.x],tmp_xcm[threadIdx.x]);
atomicAdd(&ycm[threadIdx.x],tmp_ycm[threadIdx.x]);
atomicAdd(&zcm[threadIdx.x],tmp_zcm[threadIdx.x]);
atomicAdd(&sum_fx[threadIdx.x],tmp_fx[threadIdx.x]);
atomicAdd(&sum_fy[threadIdx.x],tmp_fy[threadIdx.x]);
atomicAdd(&sum_fz[threadIdx.x],tmp_fz[threadIdx.x]);
}
But, there are some problems.
Because the number of thread block is 128 in my code, I allocate shared memory and global memory size as 128.
How can I do if I want to use shared memory larger than max number of thread size 1,024? (when there are more than 1,024 p_type)
If I allocate shared memory size as 1,024 or higher value, system says
ptxas error : Entry function '_Z29KERNEL_RIGID_force_sum_sharedP17particles_array_1P17particles_array_3PiPdS4_S4_S4_S4_S4_' uses too much shared data (0xd000 bytes, 0xc000 max)
To put it simply, I don't know what to do when the size to perform reduction is more than 1,024.
Is it possible to calculate using anything else other than threadIdx.x?
Could you give me some advice?
Shared memory is limited in size. The default limits for most GPUs is 48KB. It has no direct connection to the number of threads in the threadblock. Some GPUs can go as high as 96KB, but you haven't indicated what GPU you are running on. The error you are getting is not directly related to the number of threads per block you have, but to the amount of shared memory you are requesting per block.
If the amount of shared memory you need exceeds the shared memory available, you'll need to come up with another algorithm. For example, a shared memory reduction using atomics (what you seem to have here) could be converted into an equivalent operation using global atomics.
Another approach would be to determine if it is possible to reduce the size of the array elements you are using. I have no idea what your types (Real, int_t) correspond to, but depending on the types, you may be able to get larger array sizes by converting to 16-bit types. cc7.x or higher devices can do atomic add operations on 16-bit floating point, for example, and with a bit of effort you can even do atomics on 8-bit integers.

Are needless write operations in multi-thread kernels in CUDA inefficient?

I have a kernel in my CUDA code where I want a bunch of threads to do a bunch of computations on some piece of shared memory (because it's much faster than doing so on global memory), and then write the result to global memory (so I can use it in later kernels). The kernel looks something like this:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
*globalmem = sharedmem;//write shared memory to global memory
}
The fact that every single thread is writing the data out from shared to global memory, when I really only need to write it out once, feels fishy to me. I also get the same feeling from the fact that every thread initializes the shared memory to zero at the start of the code. Is there a faster way to do this than my current implementation?
At the warp level, there's probably not much performance difference between doing a redundant read or write vs. having a single thread do it.
However I would expect a possibly measurable performance difference by having multiple warps in a threadblock do the redundant read or write (vs. a single thread).
It should be sufficient to address these concerns by having a single thread do the read or write, rather than redundantly:
__global__ void calc(float * globalmem)
{
__shared__ float sharemem; //initialize shared memory
if (!threadIdx.x) sharemem = 0; //set it to initial value
__syncthreads();
//do various calculations on the shared memory
//for example I use atomicAdd() to add each thread's
//result to sharedmem...
__syncthreads();
if (!threadIdx.x) *globalmem = sharemem;//write shared memory to global memory
}
Although you didn't ask about it, using atomics within a threadblock on shared memory may possibly be replaceable (for possibly better performance) by a shared memory reduction method.

Very large instruction replay overhead for random memory access on Kepler

I am studying the performance of random memory access on a Kepler GPU, K40m. The kernel I use is pretty simple as follows,
__global__ void scatter(int *in1, int *out1, int * loc, const size_t n) {
int globalSize = gridDim.x * blockDim.x;
int globalId = blockDim.x * blockIdx.x + threadIdx.x;
for (unsigned int i = globalId; i < n; i += globalSize) {
int pos = loc[i];
out1[pos] = in1[i];
}
}
That is, I will read an array in1 as well as a location array loc. Then I permute in1 according to loc and output to the array out1. Generally, out1[loc[i]] = in1[i]. Note that the location array is sufficiently shuffled and each element is unique.
And I just use the default nvcc compilation setting with -O3 flag opened. The L1 dcache is disabled. Also I fix my # blocks to be 8192 and block size of 1024.
I use nvprof to profile my program. It is easy to know that most of the instructions in the kernel should be memory access. For an instruction of a warp, since each thread demands a discrete 4 Byte data, the instruction should be replayed multiple times (at most 31 times?) and issue multiple memory transactions to fulfill the need of all the threads within the warp. However, the metric "inst_replay_overhead" seems to be confusing: when # tuples n = 16M, the replay overhead is 13.97, which makes sense to me. But when n = 600M, the replay overhead becomes 34.68. Even for larger data, say 700M and 800M, the replay overhead will reach 85.38 and 126.87.
The meaning of "inst_replay_overhead", according to document, is "Average number of replays for each instruction executed". Is that mean when n = 800M, on average each instruction executed has been replayed 127 times? How comes the replay time much larger than 31 here? Am I misunderstanding something or am I missing other factors that will also contribute greatly to the replay times? Thanks a lot!
You may be misunderstanding the fundamental meaning of an instruction replay.
inst_replay_overhead includes the number of times an instruction was issued, but wasn't able to be completed. This can occur for various reasons, which are explained in this answer. Pertinent excerpt from the answer:
If the SM is not able to complete the issued instruction due to
constant cache miss on immediate constant (constant referenced in the instruction),
address divergence in an indexed constant load,
address divergence in a global/local memory load or store,
bank conflict in a
shared memory load or store,
address conflict in an atomic or
reduction operation,
load or store operation require data to be
written to the load store unit or read from a unit exceeding the
read/write bus width (e.g. 128-bit load or store), or
load cache miss
(replay occurs to fetch data when the data is ready in the cache)
then
the SM scheduler has to issue the instruction multiple times. This is
called an instruction replay.
I'm guessing this happens because of scattered reads in your case. This concept of instruction replay also exists on the CPU side of things. Wikipedia article here.

How can I make sure the compiler parallelizes my loads from global memory?

I've written a CUDA kernel that looks something like this:
int tIdx = threadIdx.x; // Assume a 1-D thread block and a 1-D grid
int buffNo = 0;
for (int offset=buffSz*blockIdx.x; offset<totalCount; offset+=buffSz*gridDim.x) {
// Select which "page" we're using on this iteration
float *buff = &sharedMem[buffNo*buffSz];
// Load data from global memory
if (tIdx < nLoadThreads) {
for (int ii=tIdx; ii<buffSz; ii+=nLoadThreads)
buff[ii] = globalMem[ii+offset];
}
// Wait for shared memory
__syncthreads();
// Perform computation
if (tIdx >= nLoadThreads) {
// Perform some computation on the contents of buff[]
}
// Switch pages
buffNo ^= 0x01;
}
Note that there's only one __syncthreads() in the loop, so the first nLoadThreads threads will start loading the data for the 2nd iteration while the rest of the threads are still computing the results for the 1st iteration.
I was thinking about how many threads to allocate for loading vs. computing, and I reasoned that I would only need a single warp for loading, regardless of buffer size, because that inner for loop consists of independent loads from global memory: they can all be in flight at the same time. Is this a valid line of reasoning?
And yet when I try this out, I find that (1) increasing the # of load warps dramatically increases performance, and (2) the disassembly in nvvp shows that buff[ii] = globalMem[ii+offset] was compiled into a load from global memory followed 2 instructions later by a store to shared memory, indicating that the compiler is not applying instruction-level parallelism here.
Would additional qualifiers (const, __restrict__, etc) on buff or globalMem help ensure the compiler does what I want?
I suspect the problem has to do with the fact that buffSz is not known at compile-time (the actual data is 2-D and the appropriate buffer size depends on the matrix dimensions). In order to do what I want, the compiler will need to allocate a separate register for each LD operation in flight, right? If I manually unroll the loop, the compiler re-orders the instructions so that there are a few LD in flight before the corresponding ST needs to access that register. I tried a #pragma unroll but the compiler only unrolled the loop without reordering the instructions, so that didn't help. What else can I do?
The compiler has no chance to reorder stores to shared memory away from loads from global memory, because a __syncthreads() barrier is immediately following.
As all off the threads have to wait at the barrier anyway, it is faster to use more threads for loading. This means that more global memory transactions can be in flight at any time, and each load thread has to incur global memory latency less often.
All CUDA devices so far do not support out-of-order execution, so the load loop will incur exactly one global memory latency per loop iteration, unless the compiler can unroll it and reorder loads before stores.
To allow full unrolling, the number of loop iterations needs to be known at compile time. You can use talonmies' suggestion of templating the loop trips to achieve this.
You can also use partial unrolling. Annotating the load loop with #pragma unroll 2 will allow the compiler to issue two loads, then two stores for every two loop iterations, thus achieve a similar effect to doubling nLoadThreads. Replacing 2 with higher numbers is possible, but you will hit the maximum number of transactions in flight at some point (use float2 or float4 moves to transfer more data with the same number of transactions). Also it is difficult to predict whether the compiler will prefer reordering instructions over the cost of more complex code for the final, potentially partial, trip through the unrolled loop.
So the suggestions are:
Use as many load threads as possible.
Unroll the load loop by templating the number of loop iterations and instantiating it for all possible number of loop trips (or the most common ones, with a generic fallback), or by using partial loop unrolling.
If the data is suitably aligned, move it as float2 or float4 to move more data with the same number of transactions.

CUDA: streaming the same memory location to all threads

Here's my problem: I have quite a big set of doubles (it's an array of 77.500 doubles) to be stored somewhere in cuda. Now, I need a big set of threads to sequentially do a bunch of operations with that array. Every thread will have to read the SAME element of that array, perform tasks, store results in shared memory and read the next element of the array. Note that every thread will simultaneously have to read (just read) from the same memory location. So I wonder: is there any way to broadcast the same double to all threads with just one memory read? Reading many times would be quite useless... Any idea??
This is a common optimization. The idea is to make each thread cooperate with its blockmates to read in the data:
// choose some reasonable block size
const unsigned int block_size = 256;
__global__ void kernel(double *ptr)
{
__shared__ double window[block_size];
// cooperate with my block to load block_size elements
window[threadIdx.x] = ptr[threadIdx.x];
// wait until the window is full
__syncthreads();
// operate on the data
...
}
You can iteratively "slide" the window across the array block_size (or maybe some integer factor more) elements at a time to consume the whole thing. The same technique applies when you'd like to store the data back in a synchronized fashion.