How is spin lock implemented under the hood? - language-agnostic

This is a lock that can be held by
only one thread of execution at a
time. An attempt to acquire the lock
by another thread of execution makes
the latter loop until the lock is
released.
How does it handle the case when two threads try to acquire the lock exactly the same time?
I think this question also applies to various of other mutex implementation.

As the previous poster indicates, every modern machine type has a special class of instruction known as 'atomics' that do operate as the previous poster indicates... they serialize execution against at least the specified memory location.
On x86, there is a LOCK assembler prefix that indicates to the machine that the next instruction should be handled atomically. When the instruction is encountered, several things effectively happen on x86.
Pending read prefetches are canceled (this means that the CPU won't present data to the program that may be made stale across the atomic).
Pending writes to memory are flushed.
The operation is performed, guaranteed atomically and serialized against other CPUs. In this context, 'serialized' means 'they happen one-at-a-time'. Atomically means "all the parts of this instruction happen without anything else intervening".
For x86, there are two commonly used instructions that are used to implement locks.
CMPXCHG. Conditional exchange. Pseudocode:
uint32 cmpxchg(uint32 *memory_location, uint32 old_value, uint32 new_value) {
atomically {
if (*memory_location == old_value)
*memory_location = new_value;
return old_value;
}
}
XCHG. Pseudocode:
uint32 xchg(uint32 *memory_location, uint32 new_value) {
atomically {
uint32 old_value = *memory_location;
*memory_location = new_value;
return *old_value;
}
}
So, you can implement a lock like this:
uint32 mylock = 0;
while (cmpxchg(&mylock, 0, 1) != 0)
;
We spin, waiting for the lock, hence, spinlock.
Now, unlocked instructions don't exhibit these nice behaviors. Depending on what machine you're on, with unlocked instructions, all sorts of violations of consistency can be observed. For example, even on x86, which has a very friendly memory consistency model, the following could be observed:
Thread 1 Thread 2
mov [w], 0 mov [x], 0
mov [w], 1 mov [x], 2
mov eax, w mov eax, x
mov [y], eax mov [z], eax
At the end of this program, y and z can both have the value 0!.
Anyway, one last note: LOCK on x86 can be applied to ADD, OR, and AND, in order to get consistent and atomic read-modify-write semantics for the instruction. This is important for, say, setting flag variables and making sure they don't get lost. Without that, you have this problem:
Thread 1 Thread 2
AND [x], 0x1 AND [x], 0x2
At the end of this program, possible values for x are 1, 2, and 0x1|0x2 (3). In order to get a correct program, you need:
Thread 1 Thread 2
LOCK AND [x], 0x1 LOCK AND [x], 0x2
Hope this helps.

Depends on the processor and the threading implementation. Most processors have instructions that can be executed atomically, on top of which you can build things like spin locks. For example IA-32 has an xchg instruction that does an atomic swap. You can then implement a naive spinlock like:
eax = 1;
while( xchg(eax, lock_address) != 0 );
// now I have the lock
... code ...
*lock_address = 0; // release the lock

Related

in arm7tdmi, when FIQ and RIQ occures at same time so how both are executed sequentially,first FIQ and thenIRQ?

in arm7tdmi, suppose instruction is being executed and at same time FIQ and IRQ both occur at same time.now according to priority FIQ will be handled then IRQ but my question is that how it will handled IRQ after return from FIQ
i means what will be process done at return of FIQ and how control will be transferred to IRQ handler after return statement of FIQ handler ?
example :
address => instruction
0x00000100 : MOV R0,R1
0x00000104 : MOV R0,R1
=>> 0x00000108 : MOV R0,R1
0x00000110 : MOV R0,R1
0x00000114 : MOV R0,R1
0x00000118 : MOV R0,R1
;suppose at 0x00000108 is instructionbeing executed and FIQ and IRQ are raised at time
Unlike the M-profile architectures, with their very different exception model which does permit tail-chaining exceptions, the classic/A-profile architectures do things in a completely straightforward manner.
Interrupts are checked for at instruction boundaries, when the respective CPSR.F/CPSR.I bits is clear. Thus, assuming the FIQ handler is straightforward, once the instruction at 0x108 completes, the FIQ is taken (as it has priority over the IRQ) from whatever mode the CPU was in, the FIQ handler runs with FIQs and IRQs masked, then performs an exception return to 0x110. The fact that there happened to be an IRQ pending throughout makes no difference whatsoever.
The point of note is the boundary between the return instruction at the end of the FIQ handler and the one being returned to. The FIQ return will restore the previous SPSR, which (presumably) has IRQs unmasked. Thus, after executing that return instruction but before executing the one at 0x110, the CPU is back in the initial mode, with IRQs unmasked, and an IRQ pending. So it takes it; the IRQ handler runs with IRQs masked, then performs an exception return to 0x110, whereupon execution eventually continues having served both interrupts.
For ARM7TDMI, that's really all there is to it. In newer architecture versions (ARMv7 onwards), there are some rules tightening up precisely when asynchronous exceptions are expected to be taken, since once CPU designs start becoming superscalar and/or out-of-order the notion of "instruction boundary" gets a bit blurry. This particular situation, though, would be no different on modern CPUs, as the exception return from FIQ constitutes a context-synchronising event after which any pending asynchronous exception (i.e. the IRQ) must be immediately taken.

Which atomics targetting pattern is better for a GPU warp: All-the-same or all-distinct?

Suppose we have:
A single warp (of 32 threads)
Each thread t which has 32 int values val t,0...val t,31,
Each value val t,i needs to be added, atomically, to a variable desti, which resides in (Option 1) global device memory (Option 2) shared block memory.
Which access pattern would be faster to perform these additions:
All threads atomic-add val t,1 to dest 1.
All threads atomic-add val t,2 to dest 2.
etc.
Each thread, with index t, writes val t,t to dest t
Each thread, with index t, writes val t, (t+1) mod 32 to dest (t+1) mod 32
etc.
In other words, is it faster when all threads of a warp make an atomic write in the same cycle, or is it better that no atomic writes coincide? I can think of hardware which carries out either of the options faster, I want to know what's actually implemented.
Thoughts:
It's possible for a GPU to have hardware which bunches together multiple atomic ops from the same warp to a single destination, so that they actually count as just one, or at least can be scheduled together, and thus all threads will proceed to the next instruction at the same time, not waiting for the last atomic op to conclude after all the rest are done.
Notes:
This question focuses on NVidia hardware with CUDA but I'd appreciate answers regarding AMD and other GPUs.
Never mind how the threads get them. Assume that they're in registers and there's no spillage, or that they're the result of some arithmetic operation done in-registers. Forget about any memory accesses to get them.
This is how I understand your problem:
You have a 32x32 matrix of int:
Val0'0, Val1'0,....Val31'0
Val1'0, Val1'1,....Val31'1
.
.
Val31'0, Val31'1,...,Val31'31
And you need to sum each row:
Val0'0 + Val1'0 + ... Val31'0 = dest0
Val0'1 + Val1'1 + ... Val31'1 = dest1 etc.
The problem is that your rows values are distributed between different threads.
For a single warp, the cleanest way to approach this would be for each thread to share it's values using shared memory (in a 32x32 shared memory array). After thread sync, thread i sums the i'th row, and writes the results to dest(i) which can reside in global or shared memory (depends on you application).
This way the computation work (31x31) additions are divided evenly between the threads in the warp AND you don't need atomic ops (performance killers) at all.
From my experience atomic operation usually can and should be avoided by a different distribution of the work between the threads.

CUDA memory operation order within a single thread

From the CUDA Programming Guide (v. 5.5):
The CUDA programming model assumes a device with a weakly-ordered
memory model, that is:
The order in which a CUDA thread writes data to shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the data is observed being
written by another CUDA or host thread;
The order in which a CUDA thread reads data from shared memory, global memory, page-locked host memory, or the memory of a peer device
is not necessarily the order in which the read instructions appear in
the program for instructions that are independent of each other
However, do we have a guarantee that the (dependent) memory operations as seen from the single thread are actually consistent? If I do - say:
arr[x] = 1;
int z = arr[y];
where x happens to be equal to y, and no other thread is touching the memory, do I have a guarantee that z is 1? Or do I still need to put some volatile or a barrier between those two operations?
In response to Orpedo's answer.
If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
My problem is what optimizations (done either by compiler or hardware) are allowed?
It could happen --- for example --- that store instruction is non-blocking and the load instruction that follows somehow is managed by the memory controller faster than the already queued-up store.
I don't know CUDA hardware. Do I have a guarantee that the above will never happen?
The CUDA Programming Guide simply stating, that you cannot predict in which order the threads is executed, but every single thread will still run as a sequential thread.
In the example you state, where x and y are the same and NO OTHER THREAD is touching the memory, you DO have a guarantee that z = 1.
Here the point being, that if you have several threads dooing operations on the same data (e.g. an array), you are NOT guaranteed that thread #9 executes before #10.
Take an example:
__device__ void sum_all(float *x, float *result, int size N){
x[threadId.x] = threadId.x;
result[threadId.x] = 0;
for(int i = 0; i < N; i++)
result[threadId.x] += x[threadID.x];
}
Here we have some dumb function, which SHOULD fill a shared array (x) with the numbers from m ... n (read from one number to another number), and then sum up the numbers already put into the array and store the result in another array.
Given that you your lowest indexed thread is enumerated thread #0, you would expect that the first time your code runs this code x should contain
x[] = {0, 0, 0 ... 0} and result[] = {0, 0, 0 ... 0}
next for thread #1
x[] = {0, 1, 0 ... 0} and result[] = {0, 1, 0 ... 0}
next for thread #2
x[] = {0, 1, 2 ... 0} and result[] = {0, 1, 3 ... 0}
and so forth.
But this is NOT guaranteed. You can't know if e.g. thread #3 runs first, hence changing the array x[] before thread #0 runs. You actually don't even know if the arrays are changed by some other thread while you are executing the code.
I am not sure, if this is explicitly stated in the CUDA documentation (I wouldn't expect it to be), as this is a basic principle of computing. Basically what you are asking is, if running your code on a GFX will change the functionality of your code.
The cores of a GPU are generally the same, as that of a CPU, just with less control-arithmetics, a smaller instructionset and typically only supporting single-precision.
In a CUDA-GPU there is 1 program counter for each Warp (section of 32 synchronous cores). Like a CPU, the program counter increases by magnitude of one address element after each instruction, unless you have branches or jumps. This gives the sequential flow of the program, and this can not be changed.
Branches and jumps can only be introduced by the software running on the core, and hence is determined by your compiler. Compiler optimizations can in fact change the functionality of your code, but only in the case where the code is implemented "wrong" with respect to the compiler
So in short - Your code will always be executed in the order it is ordered in the memory, no matter if it is executed on a CPU or a GPU. If your compiler doesn't compile the functionality stated by your code into equal functionality in machine-code, the compiler is either broken or you haven't taken the optimizations into consideration...
Hope this was clear enough :)
As far as I understood you're basically asking whether memory dependencies and alias analysis information are being respected in the CUDA compiler.
The answer to that question is, assuming that the CUDA compiler is free of bugs, yes because as Robert noted the CUDA compiler uses LLVM under the hood and two basic modules (which, at the moment, I really don't think they could be excluded by the pipeline) are:
Memory dependence analysis
Alias Analysis
These two passes detect memory locations potentially pointing to the same address and use live-analysis on variables (even out of the block scope) to avoid dangerous optimizations (e.g. you can't write in a live variable before its next read, the data may still be useful).
I don't know the compiler internals but assuming (as any other reasonably trusted compiler) that it will do its best to be bug-free, the analysis that take place in there should really not bother you at all and assure you that at least in theory what you just presented as an example (i.e. the dependent-load faster than the store) cannot happen.
What guarantee you that? Nothing but the fact that the company is giving a compiler to use, and there are disclaimers in case it doesn't for exceptional cases :)
Also: aside from the compiler topic, the instruction execution is also dependent on the hardware specification. In this case, a SIMT hardware instruction issuing unit
cfr. http://www.csl.cornell.edu/~cbatten/pdfs/kim-simt-vstruct-isca2013.pdf and all the referenced papers for more information

__threadfence() and L1 cache coherence

It is my understanding (see e.g. How can I enforce CUDA global memory coherence without declaring pointer as volatile?, CUDA block synchronization differences between GTS 250 and Fermi devices and this post in the nvidia Developer Zone) that __threadfence() guarantees that a global writes will be visible to other threads before the thread continues. However, another thread could still read a stale value from its L1 cache even after the __threadfence() has returned.
That is:
Thread A writes some data to global memory, then calls __threadfence(). Then, at some time after __threadfence() has returned, and the writes are visible to all other threads, Thread B is asked to read from this memory location. It finds it has the data in L1, so loads that. Unfortunately for the developer, the data in Thread B's L1 is stale (i.e. it is as before Thread A updated this data).
First of all: is this correct?
Supposing it is, then it seems to me that __threadfence() is only useful if either one can be certain that data will not be in L1 (somewhat unlikely?) or if e.g. the read always bypasses L1 (e.g. volatile or atomics). Is this correct?
I ask because I have a relatively simple use-case - propagating data up a binary tree - using atomically-set flags and __threadfence(): the first thread to reach a node exits, and the second writes data to it based on its two children (e.g. the minimum of their data). This works for most nodes, but usually fails for at least one. Declaring the data volatile gives consistently correct results, but induces a performance hit for the 99%+ of cases where no stale value is grabbed from L1. I want to be sure this is the only solution for this algorithm. A simplified example is given below. Note that the node array is ordered breadth-first, with the leaves beginning at index start and already populated with data.
__global__ void propagate_data(volatile Node *nodes,
const unsigned int n_nodes,
const unsigned int start,
unsigned int* flags)
{
int tid, index, left, right;
float data;
bool first_arrival;
tid = start + threadIdx.x + blockIdx.x*blockDim.x;
while (tid < n_nodes)
{
// We start at a node with a full data section; modify its flag
// accordingly.
flags[tid] = 2;
// Immediately move up the tree.
index = nodes[tid].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
// If we are the second thread to reach this node then process it.
while (!first_arrival)
{
left = nodes[index].left;
right = nodes[index].right;
// If Node* nodes is not declared volatile, this occasionally
// reads a stale value from L1.
data = min(nodes[left].data, nodes[right].data);
nodes[index].data = data;
if (index == 0) {
// Root node processed, so all nodes processed.
return;
}
// Ensure above global write is visible to all device threads
// before setting flag for the parent.
__threadfence();
index = nodes[index].parent;
first_arrival = (atomicAdd(&flags[index], 1) == 0);
}
tid += blockDim.x*gridDim.x;
}
return;
}
First of all: is this correct?
Yes, __threadfence() pushes data into L2 and out to global memory. It has no effect on the L1 caches in other SMs.
Is this correct?
Yes, if you combine __threadfence() with volatile for global memory accesses, you should have confidence that values will eventually be visible to other threadblocks. Note, however that synchronization between threadblocks is not a well-defined concept in CUDA. There are no explicit mechanisms to do so and no guarantee of the order of threadblock execution, so just because you have code that has a __threadfence() somewhere operating on a volatile item, still does not really guarantee what data another threadblock may pick up. That is also dependent on the order of execution.
If you use volatile, the L1 (if enabled -- current Kepler devices don't really have L1 enabled for general global access) should be bypassed. If you don't use volatile, then the L1 for the SM that is currently executing the __threadfence() operation should be consistent/coherent with L2 (and global) at the completion of the __threadfence() operation.
Note that the L2 cache is unified across the device and is therefore always "coherent". For your use case, at least from the device code perspective, there is no difference between L2 and global memory, regardless of which SM you are on.
And, as you indicate, (global) atomics always operate on L2/global memory.

cuda, dummy/implicit block synchronization

I am aware that block sync is not possible, the only way is launching a new kernel.
BUT, let's suppose that I launch X blocks, where X corresponds to the number of the SM on my GPU. I should aspect that the scheduler will assign a block to each SM...right? And if the GPU is being utilized as a secondary graphic card (completely dedicated to CUDA), this means that, theoretically, no other process use it... right?
My idea is the following: implicit synchronization.
Let's suppose that sometimes I need only one block, and sometimes I need all the X blocks. Well, in those cases where I need just one block, I can configure my code so that the first block (or the first SM) will work on the "real" data while the other X-1 blocks (or SMs) on some "dummy" data, executing exactly the same instruction, just with some other offset.
So that all of them will continue to be synchronized, until I am going to need all of them again.
Is the scheduler reliable under this conditions? Or can you be never sure?
You've got several questions in one, so I'll try to address them separately.
One block per SM
I asked this a while back on nVidia's own forums, as I was getting results that indicated that this is not what happens. Apparently, the block scheduler will not assign a block per SM if the number of blocks is equal to the number of SMs.
Implicit synchronization
No. First of all, you cannot guarantee that each block will have its own SM (see above). Secondly, all blocks cannot access the global store at the same time. If they run synchronously at all, they will loose this synchronicity as of the first memory read/write.
Block synchronization
Now for the good news: Yes, you can. The atomic instructions described in Section B.11 of the CUDA C Programming Guide can be used to create a barrier. Assume that you have N blocks executing concurrently on your GPU.
__device__ int barrier = N;
__global__ void mykernel ( ) {
/* Do whatever it is that this block does. */
...
/* Make sure all threads in this block are actually here. */
__syncthreads();
/* Once we're done, decrease the value of the barrier. */
if ( threadIdx.x == 0 )
atomicSub( &barrier , 1 );
/* Now wait for the barrier to be zero. */
if ( threadIdx.x == 0 )
while ( atomicCAS( &barrier , 0 , 0 ) != 0 );
/* Make sure everybody has waited for the barrier. */
__syncthreads();
/* Carry on with whatever else you wanted to do. */
...
}
The instruction atomicSub(p,i) computes *p -= i atomically and is only called by the zeroth thread in the block, i.e. we only want to decrement barrier once. The instruction atomicCAS(p,c,v) sets *p = v iff *p == c and returns the old value of *p. This part just loops until barrier reaches 0, i.e. until all blocks have crossed it.
Note that you have to wrap this part in calls to __synchtreads() as the threads in a block do not execute in strict lock-step and you have to force them all to wait for the zeroth thread.
Just remember that if you call your kernel more than once, you should set barrier back to N.
Update
In reply to jHackTheRipper's answer and Cicada's comment, I should have pointed out that you should not try to start more blocks than can be concurrently scheduled on the GPU! This is limited by a number of factors, and you should use the CUDA Occupancy Calculator to find the maximum number of blocks for your kernel and device.
Judging by the original question, though, only as many blocks as there are SMs are being started, so this point is moot.
#Pedro is definitely wrong!
Achieving global synchronization has been the subject of several research works recently and, at last for non-Kepler architectures (I don't have one yet). The conclusion is always the same (or should be): it is not possible to achieve such a global synchronization across the whole GPU.
The reason is simple: CUDA blocks cannot be preempted, so given that you fully occupy the GPU, threads waiting for the barrier rendez-vous will never allow the block to terminate. Thus, it will not be removed from the SM, and will prevent the remaining blocks to run.
As a consequence, you will just freeze the GPU that will never be able to escape from this deadlock state.
-- edit to answer Pedro's remarks --
Such shortcomings have been noticed by other authors such as:
http://www.openclblog.com/2011/04/eureka.html
by the author of OpenCL in action
-- edit to answer Pedro's second remarks --
The same conclusion is made by #Jared Hoberock in this SO post:
Inter-block barrier on CUDA