In the CUDA Programming Guide in the section about Cooperative Groups, there is an example of grid-local synchronization:
grid_group grid = this_grid();
grid.sync();
Unfortunately, I didn't found precise definition of grid.sync() behavior. Is it correct to take the following definition given for __syncthreads and extend it to grid level?
void __syncthreads();
waits until all threads in the thread block have
reached this point and all global and shared memory accesses made by
these threads prior to __syncthreads() are visible to all threads in
the block.
So, my question is this correct:
this_grid().sync();
waits until all threads in the grid have
reached this point and all global and shared memory accesses made by
these threads prior to this_grid().sync() are visible to all threads in
the grid.
I doubt the correctness of this because in the CUDA Programming Guide, a couple of lines below grid.sync(); there is the following statement:
To guarantee the co-residency of the thread blocks on the GPU, the number of blocks launched needs to be carefully considered.
Does it mean that if I use so many threads so that there is no co-residency of thread blocks, I can end up in the situation where threads can deadlock?
The same question arises when I try to use coalesced_threads().sync(). Is the following correct?
coalesced_threads().sync();
waits until all active threads in the warp have
reached this point and all global and shared memory accesses made by
these threads prior to coalesced_threads().sync() are visible to all threads in
the list of active threads of warp.
Does the following example exits from while loop?
auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
if (ct.thread_rank() == 1)
while (b == 0) {
// what if a thread with rank 1 never executed?
b = 1;
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
To make the example above clear, without ct.sync() it is unsafe and can deadlock (loop infinitely):
auto ct = coalesced_threads();
assert(ct.size() == 2);
b = 0; // shared between all threads
if (ct.thread_rank() == 0)
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
}
if (ct.thread_rank() == 1)
while (b == 0) {
// what if a thread with rank 1 never executed?
b = 1;
}
So, my question is this correct:
this_grid().sync();
waits until all threads in the grid have reached this point and all global and shared memory accesses made by these threads prior to this_grid().sync() are visible to all threads in the grid.
Yes, that is correct, assuming you have a proper cooperative launch. A proper cooperative launch implies a number of things:
the cooperative launch property is true on the GPU you are running on
you have launched using a properly formed cooperative launch
you have met grid sizing requirements for a cooperative launch
after the cooperative launch, cudaGetLastError() returns cudaSuccess
Does it mean that if I use so many threads so that there is no co-residency of thread blocks
If you violate the requirements for a cooperative launch, you are exploring undefined behavior. There is no point trying to definitively answer such questions, except to say that the behavior is undefined.
Regarding your statement(s) about coalesced threads, they are correct, although the wording must be understood carefully. active threads for a particular instruction is the same as coalesced threads.
In your example, you are creating an illegal case:
auto ct = coalesced_threads();
assert(ct.size() == 2); //there are exactly 2 threads in group ct
b = 0; // shared between all threads
if (ct.thread_rank() == 0) // this means that only thread whose rank is zero can participate in the next instruction - by definition you have excluded 1 thread
while (b == 0) {
// what if only rank 0 thread is always taken due to thread divergence?
// it is illegal to request a synchronization of a group of threads when your conditional code prevents one or more threads in the group from participating
ct.sync(); // does it guarantee that rank 0 will wait for rank 1?
}
two different .sync() statements, in different places in the code, cannot satisfy the requirements of a single sync barrier. They each represent an individual barrier, whose requirements must be properly met.
Due to the illegal coding, this example also has undefined behavior; the same comments apply.
Related
The shared memory is not synchronized between threads in a block. But I don't know if the shared memory is synchronized with the writer thread.
For example, in this example:
__global__ void kernel()
{
__shared__ int i, j;
if(threadIdx.x == 0)
{
i = 10;
j = i;
}
// #1
}
Is it guaranteed at #1 that, for thread 0, i=10 and j=10, or do I need some memory fence or introduce a local variable?
I'm going to assume that by
for thread 0
you mean, "the thread that passed the if-test". And for the sake of this discussion, I will assume there is only one of those.
Yes, it's guaranteed. Otherwise basic C++ compliance would be broken in CUDA.
Challenges in CUDA may arise in inter-thread communication or behavior. However you don't have that in view in your question.
As an example, it is certainly not guaranteed that for some other thread, i will be visible as 10, without some sort of fence or barrier.
Quoting from the Independent Thread Scheduling section (page 27) of the Volta whitepaper:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the
same instruction for all active threads in a warp just as before, retaining the execution efficiency
of previous architectures
From my understanding, this implies that if there is no divergence within threads of a warp,
(i.e. all threads of a warp are active), the threads should execute in lockstep.
Now, consider listing 8 from this blog post, reproduced below:
unsigned tid = threadIdx.x;
int v = 0;
v += shmem[tid+16]; __syncwarp(); // 1
shmem[tid] = v; __syncwarp(); // 2
v += shmem[tid+8]; __syncwarp(); // 3
shmem[tid] = v; __syncwarp(); // 4
v += shmem[tid+4]; __syncwarp(); // 5
shmem[tid] = v; __syncwarp(); // 6
v += shmem[tid+2]; __syncwarp(); // 7
shmem[tid] = v; __syncwarp(); // 8
v += shmem[tid+1]; __syncwarp(); // 9
shmem[tid] = v;
Since we don't have any divergence here, I would expect the threads to already be executing in lockstep without
any of the __syncwarp() calls.
This seems to contradict the statement I quote above.
I would appreciate if someone can clarify this confusion?
From my understanding, this implies that if there is no divergence within threads of a warp, (i.e. all threads of a warp are active), the threads should execute in lockstep.
If all threads in a warp are active for a particular instruction, then by definition there is no divergence. This has been true since day 1 in CUDA. It's not logical in my view to connect your statement with the one you excerpted, because it is a different case:
Note that execution is still SIMT: at any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before, retaining the execution efficiency of previous architectures
This indicates that the active threads are in lockstep. Divergence is still possible. The inactive threads (if any) would be somehow divergent from the active threads. Note that both of these statements are describing the CUDA SIMT model and they have been correct and true since day 1 of CUDA. They are not specific to the Volta execution model.
For the remainder of your question, I guess instead of this:
I would appreciate if someone can clarify this confusion?
You are asking:
Why is the syncwarp needed?
Two reasons:
As stated near the top of that post:
Thread synchronization: synchronize threads in a warp and provide a memory fence. __syncwarp
A memory fence is needed in this case, to prevent the compiler from "optimizing" shared memory locations into registers.
The CUDA programming model provides no specified order of thread execution. It would be a good idea for you to acknowledge that statement as ground truth. If you write code that requires a specific order of thread execution (for correctness), and you don't provide for it explicitly in your source code as a programmer, your code is broken. Regardless of the way it behaves or what results it produces.
The volta whitepaper is describing the behavior of a specific hardware implementation of a CUDA-compliant device. The hardware may ensure things that are not guaranteed by the programming model.
It is my understanding (see e.g. How can I enforce CUDA global memory coherence without declaring pointer as volatile?, CUDA block synchronization differences between GTS 250 and Fermi devices and this post in the nvidia Developer Zone) that __threadfence() guarantees that a global writes will be visible to other threads before the thread continues. However, another thread could still read a stale value from its L1 cache even after the __threadfence() has returned.
That is:
Thread A writes some data to global memory, then calls __threadfence(). Then, at some time after __threadfence() has returned, and the writes are visible to all other threads, Thread B is asked to read from this memory location. It finds it has the data in L1, so loads that. Unfortunately for the developer, the data in Thread B's L1 is stale (i.e. it is as before Thread A updated this data).
First of all: is this correct?
Supposing it is, then it seems to me that __threadfence() is only useful if either one can be certain that data will not be in L1 (somewhat unlikely?) or if e.g. the read always bypasses L1 (e.g. volatile or atomics). Is this correct?
I ask because I have a relatively simple use-case - propagating data up a binary tree - using atomically-set flags and __threadfence(): the first thread to reach a node exits, and the second writes data to it based on its two children (e.g. the minimum of their data). This works for most nodes, but usually fails for at least one. Declaring the data volatile gives consistently correct results, but induces a performance hit for the 99%+ of cases where no stale value is grabbed from L1. I want to be sure this is the only solution for this algorithm. A simplified example is given below. Note that the node array is ordered breadth-first, with the leaves beginning at index start and already populated with data.
__global__ void propagate_data(volatile Node *nodes,
const unsigned int n_nodes,
const unsigned int start,
unsigned int* flags)
{
int tid, index, left, right;
float data;
bool first_arrival;
tid = start + threadIdx.x + blockIdx.x*blockDim.x;
while (tid < n_nodes)
{
// We start at a node with a full data section; modify its flag
// accordingly.
flags[tid] = 2;
// Immediately move up the tree.
index = nodes[tid].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
// If we are the second thread to reach this node then process it.
while (!first_arrival)
{
left = nodes[index].left;
right = nodes[index].right;
// If Node* nodes is not declared volatile, this occasionally
// reads a stale value from L1.
data = min(nodes[left].data, nodes[right].data);
nodes[index].data = data;
if (index == 0) {
// Root node processed, so all nodes processed.
return;
}
// Ensure above global write is visible to all device threads
// before setting flag for the parent.
__threadfence();
index = nodes[index].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
}
tid += blockDim.x*gridDim.x;
}
return;
}
First of all: is this correct?
Yes, __threadfence() pushes data into L2 and out to global memory. It has no effect on the L1 caches in other SMs.
Is this correct?
Yes, if you combine __threadfence() with volatile for global memory accesses, you should have confidence that values will eventually be visible to other threadblocks. Note, however that synchronization between threadblocks is not a well-defined concept in CUDA. There are no explicit mechanisms to do so and no guarantee of the order of threadblock execution, so just because you have code that has a __threadfence() somewhere operating on a volatile item, still does not really guarantee what data another threadblock may pick up. That is also dependent on the order of execution.
If you use volatile, the L1 (if enabled -- current Kepler devices don't really have L1 enabled for general global access) should be bypassed. If you don't use volatile, then the L1 for the SM that is currently executing the __threadfence() operation should be consistent/coherent with L2 (and global) at the completion of the __threadfence() operation.
Note that the L2 cache is unified across the device and is therefore always "coherent". For your use case, at least from the device code perspective, there is no difference between L2 and global memory, regardless of which SM you are on.
And, as you indicate, (global) atomics always operate on L2/global memory.
I have a piece of serial code which does something like this
if( ! variable )
{
do some initialization here
variable = true;
}
I understand that this works perfectly fine in serial and will only be executed once. What atomics operation would be the correct one here in CUDA?
It looks to me like what you want is a "critical section" in your code. A critical section allows one thread to execute a sequence of instructions while preventing any other thread or threadblock from executing those instructions.
A critical section can be used to control access to a memory area, for example, so as to allow un-conflicted access to that area by a single thread.
Atomics by themselves can only be used for a very limited, basically single operation, on a single variable. But atomics can be used to build a critical section.
You should use the following code in your kernel to control thread access to a critical section:
__syncthreads();
if (threadIdx.x == 0)
acquire_semaphore(&sem);
__syncthreads();
//begin critical section
// ... your critical section code goes here
//end critical section
__threadfence(); // not strictly necessary for the lock, but to make any global updates in the critical section visible to other threads in the grid
__syncthreads();
if (threadIdx.x == 0)
release_semaphore(&sem);
__syncthreads();
Prior to the kernel define these helper functions and device variable:
__device__ volatile int sem = 0;
__device__ void acquire_semaphore(volatile int *lock){
while (atomicCAS((int *)lock, 0, 1) != 0);
}
__device__ void release_semaphore(volatile int *lock){
*lock = 0;
__threadfence();
}
I have tested and used successfully the above code. Note that it essentially arbitrates between threadblocks using thread 0 in each threadblock as a requestor. You should further condition (e.g. if (threadIdx.x < ...)) your critical section code if you want only one thread in the winning threadblock to execute the critical section code.
Having multiple threads within a warp arbitrate for a semaphore presents additional complexities, so I don't recommend that approach. Instead, have each threadblock arbitrate as I have shown here, and then control your behavior within the winning threadblock using ordinary threadblock communication/synchronization methods (e.g. __syncthreads(), shared memory, etc.)
Note that this methodology will be costly to performance. You should only use critical sections when you cannot figure out how to otherwise parallelize your algorithm.
Finally, a word of warning. As in any threaded parallel architecture, improper use of critical sections can lead to deadlock. In particular, making assumptions about order of execution of threadblocks and/or warps within a threadblock is a flawed approach.
Here is an example of usage of binary_semaphore to implement a single device global "lock" that could be used for access control to a critical section.
If i have a kernel which looks back the last Xmins and calculates the average of all the values in a float[], would i experience a performance drop if all the threads are not executing the same line of code at the same time?
eg:
Say # x=1500, there are 500 data points spanning the last 2hr period.
# x = 1510, there are 300 data points spanning the last 2hr period.
the thread at x = 1500 will have to look back 500 places yet the thread at x = 1510 only looks back 300, so the later thread will move onto the next position before the 1st thread is finished.
Is this typically an issue?
EDIT: Example code. Sorry but its in C# as i was planning to use CUDAfy.net. Hopefully it provides a rough idea of the type of programming structures i need to run (Actual code is more complicated but similar structure). Any comments regarding whether this is suitable for a GPU / coprocessor or just a CPU would be appreciated.
public void PopulateMeanArray(float[] data)
{
float lookFwdDistance = 108000000000f;
float lookBkDistance = 12000000000f;
int counter = thread.blockIdx.x * 1000; //Ensures unique position in data is written to (assuming i have less than 1000 entries).
float numberOfTicksInLookBack = 0;
float sum = 0; //Stores the sum of difference between two time ticks during x min look back.
//Note:Time difference between each time tick is not consistent, therefore different value of numberOfTicksInLookBack at each position.
//Thread 1 could be working here.
for (float tickPosition = SDS.tick[thread.blockIdx.x]; SDS.tick[tickPosition] < SDS.tick[(tickPosition + lookFwdDistance)]; tickPosition++)
{
sum = 0;
numberOfTicksInLookBack = 0;
//Thread 2 could be working here. Is this warp divergence?
for(float pastPosition = tickPosition - 1; SDS.tick[pastPosition] > (SDS.tick[tickPosition - lookBkDistance]); pastPosition--)
{
sum += SDS.tick[pastPosition] - SDS.tick[pastPosition + 1];
numberOfTicksInLookBack++;
}
data[counter] = sum/numberOfTicksInLookBack;
counter++;
}
}
CUDA runs threads in groups called warps. On all CUDA architectures that have been implemented so far (up to compute capability 3.5), the size of a warp is 32 threads. Only threads in different warps can truly be at different locations in the code. Within a warp, threads are always in the same location. Any threads that should not be executing the code in a given location are disabled as that code is executed. The disabled threads are then just taking up room in the warp and cause their corresponding processing cycles to be lost.
In your algorithm, you get warp divergence because the exit condition in the inner loop is not satisfied at the same time for all the threads in the warp. The GPU must keep executing the inner loop until the exit condition is satisfied for ALL the threads in the warp. As more threads in a warp reach their exit condition, they are disabled by the machine and represent lost processing cycles.
In some situations, the lost processing cycles may not impact performance, because disabled threads do not issue memory requests. This is the case if your algorithm is memory bound and the memory that would have been required by the disabled thread was not included in the read done by one of the other threads in the warp. In your case, though, the data is arranged in such a way that accesses are coalesced (which is a good thing), so you do end up losing performance in the disabled threads.
Your algorithm is very simple and, as it stands, the algorithm does not fit that well on the GPU. However, I think the same calculation can be dramatically sped up on both the CPU and GPU with a different algorithm that uses an approach more like that used in parallel reductions. I have not considered how that might be done in a concrete way though.
A simple thing to try, for a potentially dramatic increase in speed on the CPU, would be to alter your algorithm in such a way that the inner loop iterates forwards instead of backwards. This is because CPUs do cache prefetches. These only work when you iterate forwards through your data.