Correct implementation of spin lock in CUDA - cuda

A lot of sources provide implementation of spin lock in CUDA:
https://devtalk.nvidia.com/default/topic/1014009/try-to-use-lock-and-unlock-in-cuda/
Cuda Mutex, why deadlock?
How to implement Critical Section in cuda?
Implementing a critical section in CUDA
https://wlandau.github.io/gpu/lectures/cudac-atomics/cudac-atomics.pdf.
They follow the same pattern:
LOCK: wait for the atomic change of the value of lock from 0 to 1
do some critical operations
UNLOCK: release the lock by setting its value to 0
Let's assume that we don't have warp-divergence or, in other words, we don't use locks for interwarp synchronization.
What is the right way to implement step 1?
Some answers propose to use atomicCAS while other atomicExch. Are both equivalent?
while (0 != (atomicCAS(&lock, 0, 1))) {}
while (atomicExch(&lock, 1) != 0) {}
What is the right way to implement step 3?
Almost all sources propose to use atomicExch for that:
atomicExch(&lock, 0);
One user proposed an alternative (Implementing a critical section in CUDA) that also make sense, but it doesn't work for him (so probably it leads to Undefined Behavior in CUDA):
lock = 0;
It seems that for general spin lock on CPU it is valid to do that: https://stackoverflow.com/a/7007893/8044236. Why we can't use it in CUDA?
Do we have to use memory fence and volatile specifier for memory accesses in step 2?
CUDA docs about atomics (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions) say that they don't guarantee ordering constraints:
Atomic functions do not act as memory fences and do not imply synchronization or ordering constraints for memory operations
Does it mean that we have to use a memory fence at the end of the critical section (2) to ensure that change inside a critical section (2) made visible to other threads before unlocking (3)?
Do CUDA guarantee that other threads will ever see the changes made by a thread with atomic operations in steps (1) and (3)?
This is not true for memory fences (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-fence-functions):
Memory fence functions only affect the ordering of memory operations by a thread; they do not ensure that these memory operations are visible to other threads (like __syncthreads() does for threads within a block (see Synchronization Functions)).
So probably it is also not true for atomic operations? If yes, all spinlock implementations in CUDA rely on UB.
How we can implement reliable spinlock in the presence of warps?
Now, provided that we have answers to all the questions above, let's remove the assumption that we don't have warp divergence. Is it possible to implement spinlock in such a case?
The main issue (deadlock) is represented at slide 30 of https://wlandau.github.io/gpu/lectures/cudac-atomics/cudac-atomics.pdf:
Is the only option to replace while loop by if in step (1) and enclose all 3 steps in single while loop as proposed, for example, in Thread/warp local lock in cuda or CUDA, mutex and atomicCAS()?

What is the right way to implement step 1? Some answers propose to use
atomicCAS while other atomicExch. Are both equivalent?
No they are not, and only the atomicCas is correct. The point of that code is to check that the change of state of the lock from unlocked to locked by a given thread worked. The atomicExch version doesn't do that because it doesn't check that the initial state is unlocked before performing the assignment.
It seems that for general spin lock on CPU it is valid to do that: https://stackoverflow.com/a/7007893/8044236. Why we can't use it in CUDA?
If you read the comments on that answer you will see that it isn't valid on a CPU either.
Do we have to use memory fence and volatile specifier for memory accesses in step 2?
That depends completely on your end use and why you are using a critical section in the first place. Obviously, if you want a given thread's manipulation of globally visible memory to be globally visible, you need either a fence or an atomic transaction to do so, and you need to ensure that values are not cached in registers by the compiler.
Do[sic] CUDA guarantee that other threads will ever see the changes made by a thread with atomic operations in steps (1) and (3)?
Yes, but only for other operations performed atomically. Atomic operations imply serialization of all memory transactions performed on a given address, and they return the previous state of an address when a thread performs an action.
How we can implement reliable spinlock in the presence of warps?
Warp divergence is irrelevant. The serialization of atomic memory operations implies warp divergence when multiple threads from the same warp try to obtain the lock

Related

Can many threads set bit on the same word simultaneously?

I need each thread of a warp deciding on setting or not its respective bit in a 32 bits word. Does this multiple setting take only one memory access, or will be one memory access for each bit set?
There is no independent bit-setting capability in CUDA. (There is a bit-field-insert instruction in PTX, but it nevertheless operates on a 32-bit quantity.)
Each thread would set a bit by doing a full 32-bit write. Such a write would need to be an atomic RMW operation in order to preserve the other bits. Therefore the accesses would effectively be serialized, at whatever the throughput of atomics are.
If memory space is not a concern, breaking the bits out into separate integers would allow you to avoid atomics.
A 32-bit packed quantity could then be quickly assembled using the __ballot() warp vote function. An example is given in the answer here.
(In fact, the warp vote function may allow you to avoid memory transactions altogether; everything can be handled in registers, if the only result you need is the 32-bit packed quantity.)

Is there a way to avoid redundant computations by warps in a thread block on partially overlapped arrays

I want to design a CUDA kernel that has thread blocks where warps read their own 1-D arrays. Suppose that a thread block with two warps takes two arrays {1,2,3,4} and {2,4,6,8}. Then each of warps would perform some computations by reading their own arrays. The computation is done based on per-array element basis. This means that the thread block would have redundant computations for the elements 2 and 4 in the arrays.
Here is my question: How I can avoid such redundant computations?
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block may be considered. But I worry about performance degradation due to hash table accesses whenever a warp access elements of an array.
Any idea or comments?
In parallel computation on many-core co-processors, it is desired to perform arithmetic operations on an independent set of data, i.e. to eliminate any sort of dependency on the set of vectors which you provide to threads/warps. In this way, the computations can run in parallel. If you want to keep track of elements that you have been previously computed (in this case, 2 and 4 which are common in the two input arrays), you have to serialize and create branches which in turn diminishes the computing performance.
In conclusion, you need to check if it is possible to eliminate redundancy at the input level by reducing the input vectors to those with different components. If not, skipping redundant computations of repeated components may not necessarily improve the performance, since the computations are performed in batch.
Let us try to understand what happens on the hardware level.
First of all, computation in CUDA happens via warps. A warp is a group of 32 threads which are synchronized to each other. An instruction is executed on all the threads of a corresponding warp at the same time instance. So technically it's not threads but the warps that finally executes on the hardware.
Now, let us suppose that somehow you are able to keep track of which elements does not need computation so you put a condition in the kernel like
...
if(computationNeeded){
compute();
}
else{
...
}
...
Now let us assume that there are 5 threads in a particular warp for which "computationNeeded" is false and the don't need computation. But according to our definition, all threads executes the same instruction. So even if you put these conditions all threads have to execute both if and else blocks.
Exception:
It will be faster if all the threads of a particular block execute either if or else condition. But its exceptional case for almost all the real-world algorithms.
Suggestion
Put a pre-processing step either on CPU or GPU, which eliminate the redundant elements from input data. Also, check if this step is worth.
Precisely, I want to make a warp skip the computation of an element once the element has been already touched by other warps, otherwise the computation goes normally because any warps never touched the element before.
Using a hash table on the shared memory dedicated into a thread block
may be considered
If you want inter-warp communication between ALL the warps of the grid then it is only possible via global memory not shared memory.

Does __syncthreads() synchronize all threads in the grid?

...or just the threads in the current warp or block?
Also, when the threads in a particular block encounter (in the kernel) the following line
__shared__ float srdMem[128];
will they just declare this space once (per block)?
They all obviously operate asynchronously so if Thread 23 in Block 22 is the first thread to reach this line, and then Thread 69 in Block 22 is the last one to reach this line, Thread 69 will know that it already has been declared?
The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4].
Example of using __syncthreads(): (source)
__global__ void globFunction(int *arr, int N)
{
__shared__ int local_array[THREADS_PER_BLOCK]; //local block memory cache
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
//...calculate results
local_array[threadIdx.x] = results;
//synchronize the local threads writing to the local memory cache
__syncthreads();
// read the results of another thread in the current thread
int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
//write back the value to global memory
arr[idx] = val;
}
To synchronize all threads in a grid currently there is not native API call. One way of synchronizing threads on a grid level is using consecutive kernel calls as at that point all threads end and start again from the same point. It is also commonly called CPU synchronization or Implicit synchronization. Thus they are all synchronized.
Example of using this technique (source):
Regarding the second question. Yes, it does declare the amount of shared memory specified per block. Take into account that the quantity of available shared memory is measured per SM. So one should be very careful how the shared memory is used along with the launch configuration.
I agree with all the answers here but I think we are missing one important point here w.r.t first question. I am not answering second answer as it got answered perfectly in the above answers.
Execution on GPU happens in units of warps. A warp is a group of 32 threads and at one time instance each thread of a particular warp execute the same instruction. If you allocate 128 threads in a block its (128/32 = ) 4 warps for a GPU.
Now the question becomes "If all threads are executing the same instruction then why synchronization is needed?". The answer is we need to synchronize the warps that belong to the SAME block. __syncthreads does not synchronizes threads in a warp, they are already synchronized. It synchronizes warps that belong to same block.
That is why answer to your question is : __syncthreads does not synchronizes all threads in a grid, but the threads belonging to one block as each block executes independently.
If you want to synchronize a grid then divide your kernel (K) into two kernels(K1 and K2) and call both. They will be synchronized (K2 will be executed after K1 finishes).
__syncthreads() waits until all threads within the same block has reached the command and all threads within a warp - that means all warps that belongs to a threadblock must reach the statement.
If you declare shared memory in a kernel, the array will only be visible to one threadblock. So each block will have his own shared memory block.
Existing answers have done a great job answering how __syncthreads() works (it allows intra-block synchronization), I just wanted to add an update that there are now newer methods for inter-block synchronization. Since CUDA 9.0, "Cooperative Groups" have been introduced, which allow synchronizing an entire grid of blocks (as explained in the Cuda Programming Guide). This achieves the same functionality as launching a new kernel (as mentioned above), but can usually do so with lower overhead and make your code more readable.
In order to provide further details, aside of the answers, quoting seibert:
More generally, __syncthreads() is a barrier primitive designed to protect you from read-after-write memory race conditions within a block.
The rules of use are pretty simple:
Put a __syncthreads() after the write and before the read when there is a possibility of a thread reading a memory location that another thread has written to.
__syncthreads() is only a barrier within a block, so it cannot protect you from read-after-write race conditions in global memory unless the only possible conflict is between threads in the same block. __syncthreads() is pretty much always used to protect shared memory read-after-write.
Do not use a __syncthreads() call in a branch or a loop until you are sure every single thread will reach the same __syncthreads() call. This can sometimes require that you break your if-blocks into several pieces to put __syncthread() calls at the top-level where all threads (including those which failed the if predicate) will execute them.
When looking for read-after-write situations in loops, it helps to unroll the loop in your head when figuring out where to put __syncthread() calls. For example, you often need an extra __syncthreads() call at the end of the loop if there are reads and writes from different threads to the same shared memory location in the loop.
__syncthreads() does not mark a critical section, so don’t use it like that.
Do not put a __syncthreads() at the end of a kernel call. There’s no need for it.
Many kernels do not need __syncthreads() at all because two different threads never access the same memory location.

Which is better, the atomic's competition between: threads of the single Warp or threads of different Warps?

Which is better, the atomic's competition (concurrency) between threads of the single Warp or between threads of different Warps in one block? I think that when you access the shared memory is better when threads of one warp are competing with each other is less than the threads of different warps. And with access to global memory on the contrary, it is better that a threads of different warps of one block competed less than the threads of single warp, isn't it?
I need it to know how better to resolve competition (concurrency) and what better to separate store: between threads in single warp or between warps.
Incidentally it may be said that the team __ syncthreads (); synchronizes it warps in a single block and not the threads of one warp?
If a significant number of threads in a block perform atomic updates to the same value, you will get poor performance since those threads must all be serialized. In such cases, it is usually better to have each thread write its result to a separate location and then, in a separate kernel, process those values.
If each thread in a warp performs an atomic update to the same value, all the threads in the warp perform the update in the same clock cycle, so they must all be serialized at the point of the atomic update. This probably means that the warp is scheduled 32 times to get all the threads serviced (very bad).
On the other hand, if a single thread in each warp in a block performs an atomic update to the same value, the impact will be lower because the pairs of warps (the two warps processed at each clock by the two warp schedulers) are offset in time (by one clock cycle), as they move through the processing pipelines. So you end up with only two atomic updates (one from each of the two warps), getting issued within one cycle and needing to immediately be serialized.
So, in the second case, the situation is better, but still problematic. The reason is that, depending on where the shared value is, you can still get serialization between SMs, and this can be very slow since each thread may have to wait for updates to go all the way out to global memory, or at least to L2, and then back. It may be possible to refactor the algorithm in such a way that threads within a block perform atomic updates to a value in shared memory (L1), and then have one thread in each block perform an atomic update to a value in global memory (L2).
The atomic operations can be complete lifesavers but they tend to be overused by people new to CUDA. It is often better to use a separate step with a parallel reduction or parallel stream compaction algorithm (see thrust::copy_if).

How to have atomic load in CUDA

My question is how I can have atomic load in CUDA. Atomic exchange can emulate atomic store. Can atomic load be emulated non-expensively in a similar manner?
I can use an atomic add with 0 to load the content atomically but I think it is expensive because it does an atomic read-modify-write instead of only a read.
In addition to using volatile as recommended in the other answer, using __threadfence appropriately is also required to get an atomic load with safe memory ordering.
While some of the comments are saying to just use a normal read because it cannot tear, that is not the same as an atomic load. There's more to atomics than just tearing:
A normal read may reuse a previous load that's already in a register, and thus may not reflect changes made by other SMs with the desired memory ordering. For instance, int *flag = ...; while (*flag) { ... } may only read flag once and reuse this value for every iteration of the loop. If you're waiting for another thread to change the flag's value, you'll never observe the change. The volatile modifier ensures that the value is actually read from memory on every access. See the CUDA documentation on volatile for more info.
Additionally, you'll need to use a memory fence to enforce the correct memory ordering in the calling thread. Without a fence, you get "relaxed" semantics in C++11 parlance, and this can be unsafe when using an atomic for communication.
For example, say your code (non-atomically) writes some large data to memory and then uses a normal write to set an atomic flag to indicate that the data has been written. The instructions may be reordered, hardware cachelines may not be flushed prior to setting the flag, etc etc. The result is that these operations are not guaranteed to be executed in any order, and other threads may not observe these events in the order you expect: The write to the flag is permitted to be happen before the guarded data is written.
Meanwhile, if the reading thread is also using normal reads to check the flag before conditionally loading the data, there will be a race at the hardware level. Out-of-order and/or speculative execution may load the data before the flag's read is completed. The speculatively loaded data is then used, which may not be valid since it was loaded prior to the flag's read.
Well-placed memory fences prevent these sorts of issues by enforcing that instruction reordering will not affect your desired memory ordering and that previous writes are made visible to other threads. __threadfence() and friends are also covered in the CUDA docs.
Putting all of this together, writing your own atomic load method in CUDA looks something like:
// addr must be aligned properly.
__device__ unsigned int atomicLoad(const unsigned int *addr)
{
const volatile unsigned int *vaddr = addr; // volatile to bypass cache
__threadfence(); // for seq_cst loads. Remove for acquire semantics.
const unsigned int value = *vaddr;
// fence to ensure that dependent reads are correctly ordered
__threadfence();
return value;
}
// addr must be aligned properly.
__device__ void atomicStore(unsigned int *addr, unsigned int value)
{
volatile unsigned int *vaddr = addr; // volatile to bypass cache
// fence to ensure that previous non-atomic stores are visible to other threads
__threadfence();
*vaddr = value;
}
This can be written similarly for other non-tearing load/store sizes.
From talking with some NVIDIA devs who work on CUDA atomics, it looks like we should start seeing better support for atomics in CUDA, and the PTX already contains load/store instructions with acquire/release memory ordering semantics -- but there is no way to access them currently without resorting to inline PTX. They're hoping to add them in sometime this year. Once those are in place, a full std::atomic implementation shouldn't be far behind.
To the best of my knowledge, there is currently no way of requesting an atomic load in CUDA, and that would be a great feature to have.
There are two quasi-alternatives, with their advantages and drawbacks:
Use a no-op atomic read-modify-write as you suggest. I have provided a similar answer in the past. Guaranteed atomicity and memory consistency but you pay the cost of a needless write.
In practice, the second closest thing to an atomic load could be marking a variable volatile, although strictly speaking the semantics are completely different. The language does not guarantee atomicity of the load (for example, you may in theory get a torn read), but you are guaranteed to get the most up-to-date value. But in practice, as indicated in the comments by #Robert Crovella, it is impossible to get a torn read for properly-aligned transactions of at most 32 bytes, which does make them atomic.
Solution 2 is kind of hacky and I do not recommend it, but it is currently the only write-less alternative to 1. The ideal solution would be to add a way to express atomic loads directly in the language.