__threadfence() and L1 cache coherence - cuda

It is my understanding (see e.g. How can I enforce CUDA global memory coherence without declaring pointer as volatile?, CUDA block synchronization differences between GTS 250 and Fermi devices and this post in the nvidia Developer Zone) that __threadfence() guarantees that a global writes will be visible to other threads before the thread continues. However, another thread could still read a stale value from its L1 cache even after the __threadfence() has returned.
That is:
Thread A writes some data to global memory, then calls __threadfence(). Then, at some time after __threadfence() has returned, and the writes are visible to all other threads, Thread B is asked to read from this memory location. It finds it has the data in L1, so loads that. Unfortunately for the developer, the data in Thread B's L1 is stale (i.e. it is as before Thread A updated this data).
First of all: is this correct?
Supposing it is, then it seems to me that __threadfence() is only useful if either one can be certain that data will not be in L1 (somewhat unlikely?) or if e.g. the read always bypasses L1 (e.g. volatile or atomics). Is this correct?
I ask because I have a relatively simple use-case - propagating data up a binary tree - using atomically-set flags and __threadfence(): the first thread to reach a node exits, and the second writes data to it based on its two children (e.g. the minimum of their data). This works for most nodes, but usually fails for at least one. Declaring the data volatile gives consistently correct results, but induces a performance hit for the 99%+ of cases where no stale value is grabbed from L1. I want to be sure this is the only solution for this algorithm. A simplified example is given below. Note that the node array is ordered breadth-first, with the leaves beginning at index start and already populated with data.
__global__ void propagate_data(volatile Node *nodes,
const unsigned int n_nodes,
const unsigned int start,
unsigned int* flags)
{
int tid, index, left, right;
float data;
bool first_arrival;
tid = start + threadIdx.x + blockIdx.x*blockDim.x;
while (tid < n_nodes)
{
// We start at a node with a full data section; modify its flag
// accordingly.
flags[tid] = 2;
// Immediately move up the tree.
index = nodes[tid].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
// If we are the second thread to reach this node then process it.
while (!first_arrival)
{
left = nodes[index].left;
right = nodes[index].right;
// If Node* nodes is not declared volatile, this occasionally
// reads a stale value from L1.
data = min(nodes[left].data, nodes[right].data);
nodes[index].data = data;
if (index == 0) {
// Root node processed, so all nodes processed.
return;
}
// Ensure above global write is visible to all device threads
// before setting flag for the parent.
__threadfence();
index = nodes[index].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
}
tid += blockDim.x*gridDim.x;
}
return;
}

First of all: is this correct?
Yes, __threadfence() pushes data into L2 and out to global memory. It has no effect on the L1 caches in other SMs.
Is this correct?
Yes, if you combine __threadfence() with volatile for global memory accesses, you should have confidence that values will eventually be visible to other threadblocks. Note, however that synchronization between threadblocks is not a well-defined concept in CUDA. There are no explicit mechanisms to do so and no guarantee of the order of threadblock execution, so just because you have code that has a __threadfence() somewhere operating on a volatile item, still does not really guarantee what data another threadblock may pick up. That is also dependent on the order of execution.
If you use volatile, the L1 (if enabled -- current Kepler devices don't really have L1 enabled for general global access) should be bypassed. If you don't use volatile, then the L1 for the SM that is currently executing the __threadfence() operation should be consistent/coherent with L2 (and global) at the completion of the __threadfence() operation.
Note that the L2 cache is unified across the device and is therefore always "coherent". For your use case, at least from the device code perspective, there is no difference between L2 and global memory, regardless of which SM you are on.
And, as you indicate, (global) atomics always operate on L2/global memory.

Related

How can I enforce ordering between writes and reads to global memory?

I have a CUDA kernel of following form:
Void launch_kernel(..Arguments...)
{
int i = threadIdx.x
//Load required data
int temp1 = A[i];
int temp2 = A[i+1];
int temp3= A[i+2];
// compute step
int output1 = temp1 + temp2 + temp3;
int output2 = temp1 + temp3;
// Store the result
B[i] = output1;
C[i] = output2;
}
As discussed in CUDA manual, the consistency model for GPU global memory is not sequential. As a result, the memory operations may appear to be performed in order different than original program order. To enforce memory ordering CUDA offers __threadfence() functions. However, as per the manual, such function enforces relative ordering across reads and relative ordering across writes. Quoting a line from manual:
All writes to shared and global memory made by the calling thread before the call to __threadfence_block() are observed by all threads in the block of the calling thread as occurring before all writes to shared memory and global memory made by the calling thread after the call to __threadfence_block();
So it is clear that __threadfence() is insufficient to enforce ordering among reads and writes.
How do I enforce the ordering across reads and writes to global memory. Alternatively, how do I make sure that all the reads are guaranteed to be completed before executing the compute and store section of above kernel.
Like #RobertCrovella said in his comment, your code will work fine as it is.
temp1, temp2, and temp3 are local (which will use either registers or local memory {per thread global memory}). These aren't shared between threads, so there's no concurrency concerns whatsoever. They will work just like regular C/C++.
A, B, and C are global. These will be subject to synchronization concerns. A is used as read only so access order doesn't matter. B and C are written, but each thread only writes to it's own index so the order they are written doesn't matter. Your concern about guaranteeing global memory reads are finished is unnecessary. Within a thread, your code will execute in the order written with appropriate stalls for global memory access. You wouldn't want to for performance reasons, but you can do things like B[i] = 0; B[i] = 5; temp1 = B[i]; and have temp1 guaranteed to be 5.
You don't use shared memory in this example, however it is local to thread blocks, and you can synchronize within the thread block using __syncthreads();
Synchronization of global memory across different thread blocks requires ending one kernel and beginning another. NVidia claims they are working on a better way in one of their future directions videos on youtube.

Very large instruction replay overhead for random memory access on Kepler

I am studying the performance of random memory access on a Kepler GPU, K40m. The kernel I use is pretty simple as follows,
__global__ void scatter(int *in1, int *out1, int * loc, const size_t n) {
int globalSize = gridDim.x * blockDim.x;
int globalId = blockDim.x * blockIdx.x + threadIdx.x;
for (unsigned int i = globalId; i < n; i += globalSize) {
int pos = loc[i];
out1[pos] = in1[i];
}
}
That is, I will read an array in1 as well as a location array loc. Then I permute in1 according to loc and output to the array out1. Generally, out1[loc[i]] = in1[i]. Note that the location array is sufficiently shuffled and each element is unique.
And I just use the default nvcc compilation setting with -O3 flag opened. The L1 dcache is disabled. Also I fix my # blocks to be 8192 and block size of 1024.
I use nvprof to profile my program. It is easy to know that most of the instructions in the kernel should be memory access. For an instruction of a warp, since each thread demands a discrete 4 Byte data, the instruction should be replayed multiple times (at most 31 times?) and issue multiple memory transactions to fulfill the need of all the threads within the warp. However, the metric "inst_replay_overhead" seems to be confusing: when # tuples n = 16M, the replay overhead is 13.97, which makes sense to me. But when n = 600M, the replay overhead becomes 34.68. Even for larger data, say 700M and 800M, the replay overhead will reach 85.38 and 126.87.
The meaning of "inst_replay_overhead", according to document, is "Average number of replays for each instruction executed". Is that mean when n = 800M, on average each instruction executed has been replayed 127 times? How comes the replay time much larger than 31 here? Am I misunderstanding something or am I missing other factors that will also contribute greatly to the replay times? Thanks a lot!
You may be misunderstanding the fundamental meaning of an instruction replay.
inst_replay_overhead includes the number of times an instruction was issued, but wasn't able to be completed. This can occur for various reasons, which are explained in this answer. Pertinent excerpt from the answer:
If the SM is not able to complete the issued instruction due to
constant cache miss on immediate constant (constant referenced in the instruction),
address divergence in an indexed constant load,
address divergence in a global/local memory load or store,
bank conflict in a
shared memory load or store,
address conflict in an atomic or
reduction operation,
load or store operation require data to be
written to the load store unit or read from a unit exceeding the
read/write bus width (e.g. 128-bit load or store), or
load cache miss
(replay occurs to fetch data when the data is ready in the cache)
then
the SM scheduler has to issue the instruction multiple times. This is
called an instruction replay.
I'm guessing this happens because of scattered reads in your case. This concept of instruction replay also exists on the CPU side of things. Wikipedia article here.

How can I make sure the compiler parallelizes my loads from global memory?

I've written a CUDA kernel that looks something like this:
int tIdx = threadIdx.x; // Assume a 1-D thread block and a 1-D grid
int buffNo = 0;
for (int offset=buffSz*blockIdx.x; offset<totalCount; offset+=buffSz*gridDim.x) {
// Select which "page" we're using on this iteration
float *buff = &sharedMem[buffNo*buffSz];
// Load data from global memory
if (tIdx < nLoadThreads) {
for (int ii=tIdx; ii<buffSz; ii+=nLoadThreads)
buff[ii] = globalMem[ii+offset];
}
// Wait for shared memory
__syncthreads();
// Perform computation
if (tIdx >= nLoadThreads) {
// Perform some computation on the contents of buff[]
}
// Switch pages
buffNo ^= 0x01;
}
Note that there's only one __syncthreads() in the loop, so the first nLoadThreads threads will start loading the data for the 2nd iteration while the rest of the threads are still computing the results for the 1st iteration.
I was thinking about how many threads to allocate for loading vs. computing, and I reasoned that I would only need a single warp for loading, regardless of buffer size, because that inner for loop consists of independent loads from global memory: they can all be in flight at the same time. Is this a valid line of reasoning?
And yet when I try this out, I find that (1) increasing the # of load warps dramatically increases performance, and (2) the disassembly in nvvp shows that buff[ii] = globalMem[ii+offset] was compiled into a load from global memory followed 2 instructions later by a store to shared memory, indicating that the compiler is not applying instruction-level parallelism here.
Would additional qualifiers (const, __restrict__, etc) on buff or globalMem help ensure the compiler does what I want?
I suspect the problem has to do with the fact that buffSz is not known at compile-time (the actual data is 2-D and the appropriate buffer size depends on the matrix dimensions). In order to do what I want, the compiler will need to allocate a separate register for each LD operation in flight, right? If I manually unroll the loop, the compiler re-orders the instructions so that there are a few LD in flight before the corresponding ST needs to access that register. I tried a #pragma unroll but the compiler only unrolled the loop without reordering the instructions, so that didn't help. What else can I do?
The compiler has no chance to reorder stores to shared memory away from loads from global memory, because a __syncthreads() barrier is immediately following.
As all off the threads have to wait at the barrier anyway, it is faster to use more threads for loading. This means that more global memory transactions can be in flight at any time, and each load thread has to incur global memory latency less often.
All CUDA devices so far do not support out-of-order execution, so the load loop will incur exactly one global memory latency per loop iteration, unless the compiler can unroll it and reorder loads before stores.
To allow full unrolling, the number of loop iterations needs to be known at compile time. You can use talonmies' suggestion of templating the loop trips to achieve this.
You can also use partial unrolling. Annotating the load loop with #pragma unroll 2 will allow the compiler to issue two loads, then two stores for every two loop iterations, thus achieve a similar effect to doubling nLoadThreads. Replacing 2 with higher numbers is possible, but you will hit the maximum number of transactions in flight at some point (use float2 or float4 moves to transfer more data with the same number of transactions). Also it is difficult to predict whether the compiler will prefer reordering instructions over the cost of more complex code for the final, potentially partial, trip through the unrolled loop.
So the suggestions are:
Use as many load threads as possible.
Unroll the load loop by templating the number of loop iterations and instantiating it for all possible number of loop trips (or the most common ones, with a generic fallback), or by using partial loop unrolling.
If the data is suitably aligned, move it as float2 or float4 to move more data with the same number of transactions.

Synchronization in CUDA

I read cuda reference manual for about synchronization in cuda but i don't know it clearly. for example why we use cudaDeviceSynchronize() or __syncthreads()? if don't use them what happens, program can't work correctly? what difference between cudaMemcpy and cudaMemcpyAsync in action? can you show an example that show this difference?
cudaDeviceSynchronize() is used in host code (i.e. running on the CPU) when it is desired that CPU activity wait on the completion of any pending GPU activity. In many cases it's not necessary to do this explicitly, as GPU operations issued to a single stream are automatically serialized, and certain other operations like cudaMemcpy() have an inherent blocking device synchronization built into them. But for some other purposes, such as debugging code, it may be convenient to force the device to finish any outstanding activity.
__syncthreads() is used in device code (i.e. running on the GPU) and may not be necessary at all in code that has independent parallel operations (such as adding two vectors together, element-by-element). However, one example where it is commonly used is in algorithms that will operate out of shared memory. In these cases it's frequently necessary to load values from global memory into shared memory, and we want each thread in the threadblock to have an opportunity to load it's appropriate shared memory location(s), before any actual processing occurs. In this case we want to use __syncthreads() before the processing occurs, to ensure that shared memory is fully populated. This is just one example. __syncthreads() might be used any time synchronization within a block of threads is desired. It does not allow for synchronization between blocks.
The difference between cudaMemcpy and cudaMemcpyAsync is that the non-async version of the call can only be issued to stream 0 and will block the calling CPU thread until the copy is complete. The async version can optionally take a stream parameter, and returns control to the calling thread immediately, before the copy is complete. The async version typically finds usage in situations where we want to have asynchronous concurrent execution.
If you have basic questions about CUDA programming, it's recommended that you take some of the webinars available.
Moreover, __syncthreads() becomes really necessary when you have some conditional paths in your code, and then you want to run an operation that depends on several array element.
Consider the following example:
int n = threadIdx.x;
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
double y = myarray[n] + myarray[n+1]; // Not all threads reaches here at the same time
In the above example, not all threads will have the same execution sequence. Some threads will take longer based on the if condition. When considering the last line of the example, you need to make sure that all the threads had exactly finished the if-condition and updated myarray correctly. If this wasn't the case, y may use some updated and non-updated values.
In this case, it becomes a must to add __syncthreads() before evaluating y to overcome this problem:
if( myarray[n] > 0 )
{
myarray[n] = - myarray[n];
}
__syncthreads(); // All threads will wait till they come to this point
// We are now quite confident that all array values are updated.
double y = myarray[n] + myarray[n+1];

How to share a common value between threads in a given block?

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!
Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.
If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).