What's the difference between safe, regular and atomic registers? - language-agnostic

Can anyone provide exhaustive explanation, please? I'm diving into concurrent programming and met those registers while trying to understand consensus.
From Lamport's "On interprocess communication": ...a regular register is atomic if two successive reads that overlap the same write cannot obtain the new then the old value....
Assume, that first comes thread0.write(0) - with no overlapping. Basically, one can say using Lamports definition that thread1 can read first 1 and then 0 again, if both reads are consequent and overlap with thread0.write(1). But how is that possible?

Reads and writes to a shared memory location take a finite period of time, so they may either overlap, or be completely distinct.
e.g.
Thread 1: wwwww wwwww
Thread 2: rrrrr rrrrr
Thread 3: rrrrr rrrrr
The first read from thread 2 overlaps with the first write from thread 1, whilst the second read and second write do not overlap. In thread 3, both reads overlap the first write.
A safe register is only safe as far as reads that do not overlap writes. If a read does not overlap any writes then it must read the value written by the most recent write. Otherwise it may return any value that the register may hold. So, in thread 2, the second read must return the value written by the second write, but the first read can return any valid value.
A regular register adds the additional guarantee that if a read overlaps with a write then it will either read the old value or the new one, but multiple reads that overlap the write do not have to agree on which, and the value may appear to "flicker" back and forth. This means that two reads from the same thread (such as in thread 3 above) that both overlap the write may appear "out of order": the earlier read returning the new value, and the later returning the old value.
An atomic register guarantees that the reads and writes appears to happen at a single point in time. Readers that act at a point before that point will all read the old value and readers that act after that point will all read the new value. In particular, if two reads from the same thread overlap a write then the later read cannot return the old value if the earlier read returns the new one. Atomic registers are linearizable.
The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit gives a good description, along with examples and use cases.

Related

performance differences between iterative and recursive method in mips

I have 2 version of a program I must analyze. One is recursive and the other is iterative. I must compare cache hit rate for both and examine as performance varies for both methods as well as their instructions count
For both methods regardless of block setting I get roughly 100 less memory access for the iterative method. both trigger 2 misses. I can only drop the cache hit rate to 85% if i set the setting to 1 block of size 256.
for the instructions count the iterative is roughly 1000 instructions less
Can someone explain to me why this happens or provide some literature I can read this in I can't seem to find anything. I would just like a general overview of why this occurs.
Took some info from this question: Recursion or Iteration? and some from in COMP 273 at McGill, which I assume you're also in and that's why you asked.
Each recursive call generally requires the return address of that call to be pushed onto the stack; in MIPS (assembly) this must be done manually otherwise the return address gets overwritten with each jal. As such, usually more cache space is used for a recursive solution and as such the memory access count is higher. In an iterative solution this isn't necessary whatsoever.

Kernel design for overlapping data, launch of a seperate warp

i have a question regarding a CFD application i am trying to implement according to a paper i found online. this might be somewhat of a beginner question, but here it goes:
the situation is as follows:
the 2D domain gets decomposed into tiles. Each of these tiles is being processed by a block of the kernel in question. The calculations being executed are highly suited for parallel execution, as they take into account only a handfull of their neighbours (it's a shallow water application). The tiles do overlap. Each tile has 2 extra cells on each side of the domain it's supposed to calculate the result to.
on the left you see 1 block, on the right 4, with the overlap that comes with it. grey are "ghost cells" needed for the calculation. light green is the domain each block actually writed back to global memory. needless to say the whole domain is going to have more than 4 tiles.
The idea per thread goes as following:
(1) copy data from global memory to shared memory
__synchthreads();
(2) perform some calculations
__synchthreads();
(3) perform some more calculations
(4) write back to globabl memory
for the cells in the green area, the Kernel is straight forward, you copy data according to your threadId, and calculate along using your neighbours in shared memory. Because of the nature of the data dependency this does however not suffice:
(1) has to be run on all cells (grey and green). No dependency.
(2) has to be run on all green cells, and the inner rows/columns of the grey cells. Depends on neighbouring data N,S,E and W.
(3) has to be run on all green cells. Depends on data from step (2) on neighbours N,S,E and W.
so here goes my question:
how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
i have considered using overlapping blocks as well (as opposed to just overlapping data), but this leads to another problem: the __synchthreads()-calls would have to be in conditional parts of the code.
Taking the kernel apart and having the steps (2)/(3) run in different kernels is not really an option either, as they produce intermediate results which can't all be written back to memory because of their number/size.
the author himself writes this (Brodtkorb et al. 2010, Efficient Shallow Water Simulations on GPUs:
Implementation, Visualization, Verification, and Validation):
When launching our kernel, we start by reading from global memory into on-chip shared memory. In addition to the interior cells of our block, we need to use data from two neighbouring cells in each direction to fulfill the data
dependencies of the stencil. After having read data into shared memory, we proceed by computing the one dimensional
fluxes in the x and y directions, respectively. Using the steps illustrated in Figure 1, fluxes are computed by storing
all values that are used by more than one thread in shared memory. We also perform calculations collectively within
one block to avoid duplicate computations. However, because we compute the net contribution for each cell, we have
to perform more reconstructions and flux calculations than the number of threads, complicating our kernel. This is
solved in our code by designating a single warp that performs the additional computations; a strategy that yielded a
better performance than dividing the additional computations between several warps.
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
so, what does he mean by designating a single warp to do these compuations, and how does one do so?
You could do something like this:
// work that is done by all threads in a block
__syncthreads(); // may or may not be needed
if (threadIdx.x < 32) {
// work that is done only by the designated single warp
}
Although that's trivially simple, insofar as the last question in your question is considered, and the highlighted paragraph, I think it's very likely what they are referring to. I think it fits with what I'm reading here. Furthermore I don't know of any other way to restrict work to a single warp, except by using conditionals. They may also have chosen a single warp to take advantage of warp-synchronous behavior, which gets around the __syncthreads(); in conditional code issue you mention earlier.
so here goes my question: how does one do this without a terribly cluttered code?
all i can think of is a horrible amount of "if" statements to decide whether a thread should perform some of these steps twice, depending on his threadId.
Actually, I don't think any sequence of ordinary "if" statements, regardless of how cluttered, could solve the problem you describe.
A typical way to solve the dependency between steps 2 and 3 that you have already mentioned is to separate the work into two ( or more) kernels. You indicate that this is "not really an option", but as near as I can tell, what you're looking for is a global sync. Such a concept is not well-defined in CUDA except for the kernel launch/exit points. CUDA does not guarantee execution order among blocks in a grid. If your block calculations in step 3 depend on neighboring block calculations in step 2, then in my opinion, you definitely need a global sync, and your code is going to get ugly if you don't implement it with a kernel launch. Alternative methods such as using global semaphores or global block counters are, in my opinion, fragile and difficult to apply to general cases of widespread data dependence (where every block is dependent on neighbor calculations from the previous step).
If the neighboring calculations depend on only the data from a thin set of neighboring cells ("halo") , and not the whole neighboring block, and those cells can be computed independently, then it might be possible to have your block be expanded to include neighboring cells (i.e. overlap), effectively computing the halo regions twice between neighboring blocks, but you've indicated you've already considered and discarded this idea. However, I personally would want to consider the code in detail before accepting the idea that this would be rejected based entirely on difficulty with __syncthreads(); In my experience, people who say they can't use __syncthreads(); due to conditional code execution haven't accurately considered all the options, at a detail code level, to make __syncthreads(); work, even in the midst of conditional code.

Is it possible to atomically read a double word with single-word operations?

I remember that at some point, I saw a programming challenge where people were asked to atomically read a double word with single-word operations.
Now, before we go on, a few clarifications:
A single word is a unit that has the natural number of bits that can be processed by some processor (for instance, 32 bits for 32-bit processors);
A double word is a unit that is twice the size of a single word (and therefore can't be treated at once by processor instructions);
By atomic, I mean that the result data should be consistent with what the value was at some point, and we can assume that no implementation or hardware detail will get in the way.
As I recall it, the solution was to read the highest word, then read the lowest word. Then read again the highest word, and if it hasn't changed, then the value can be deemed consistent.
However, I'm not sure anymore. Suppose that we have the two digits 01.
The algorithm would then read the high part (and get 0).
Now, another actor changes the value to 22 before our algorithm reads the next part.
We then read the low digit, 2;
but then, the pesky troublemaker changes the value again, and makes it 03.
We read the high part, and it's 0. We read 02, and the algorithm deems the value consistent; but in fact, it never was 02.
Is my recollection of the conundrum correct? Is it the solution that I don't recall correctly?
That solution sounds like it was meant for a system where the value was constantly incrementing or decrementing, rather than arbitrarily changing.
By reading high/low/high in a system where the value is incrementing, you can be sure that the value hasn't wrapped, such as (for a one-digit word) 0,9 becoming 1,0. The code would be something like:
read high into reg2 # Get high value.
set reg0 to reg2 minus 1 # Force entry in to loop.
while reg0 is not equal to reg2: # Repeat until consecutive highs are same.
set reg0 to reg2 # Transfer high.
read low into reg1 # Read low.
read high into reg2 # Read next high.
# Now reg0/reg1 contains high/low.
In any other situation where the value can arbitrarily change, you need some sort of test-and-set operation on a separate word, effectively implementing a low-level mutex to protect the double-word.

What is differential execution?

I stumbled upon a Stack Overflow question, How does differential execution work?, which has a VERY long and detailed answer. All of it made sense... but when I was done I still had no idea what the heck differential execution actually is. What is it really?
REVISED. This is my Nth attempt to explain it.
Suppose you have a simple deterministic procedure that executes repeatedly, always following the same sequence of statement executions or procedure calls.
The procedure calls themselves write anything they want sequentially to a FIFO, and they read the same number of bytes from the other end of the FIFO, like this:**
The procedures being called are using the FIFO as memory, because what they read is the same as what they wrote on the prior execution.
So if their arguments happen to be different this time from last time, they can see that, and do anything they want with that information.
To get it started, there has to be an initial execution in which only writing happens, no reading.
Symmetrically, there should be a final execution in which only reading happens, no writing.
So there is a "global" mode register containing two bits, one that enables reading and one that enables writing, like this:
The initial execution is done in mode 01, so only writing is done.
The procedure calls can see the mode, so they know there is no prior history.
If they want to create objects they can, and put the identifying information in the FIFO (no need to store in variables).
The intermediate executions are done in mode 11, so both reading and writing happen, and the procedure calls can detect data changes.
If there are objects to be kept up to date,
their identifying information is read from and written to the FIFO,
so they can be accessed and, if necessary, modified.
The final execution is done in mode 10, so only reading happens.
In that mode, the procedure calls know they are just cleaning up.
If there were any objects being maintained, their identifiers are read from the FIFO, and they can be deleted.
But real procedures do not always follow the same sequence.
They contain IF statements (and other ways of varying what they do).
How can that be handled?
The answer is a special kind of IF statement (and its terminating ENDIF statement).
Here's how it works.
It writes the boolean value of its test expression, and it reads the value that the test expression had last time.
That way, it can tell if the test expression has changed, and take action.
The action it takes is to temporarily alter the mode register.
Specifically, x is the prior value of the test expression, read from the FIFO (if reading is enabled, else 0), and y is the current value of the test expression, written to the FIFO (if writing is enabled).
(Actually, if writing is not enabled, the test expression is not even evaluated, and y is 0.)
Then x,y simply MASKs the mode register r,w.
So if the test expression has changed from True to False, the body is executed in read-only mode. Conversely if it has changed from False to True, the body is executed in write-only mode.
If the result is 00, the code inside the IF..ENDIF statement is skipped.
(You might want to think a bit about whether this covers all cases - it does.)
It may not be obvious, but these IF..ENDIF statements can be arbitrarily nested, and they can be extended to all other kinds of conditional statements like ELSE, SWITCH, WHILE, FOR, and even calling pointer-based functions. It is also the case that the procedure can be divided into sub-procedures to any extent desired, including recursive, as long as the mode is obeyed.
(There is a rule that must be followed, called the erase-mode rule, which is that in mode 10 no computation of any consequence, such as following a pointer or indexing an array, should be done. Conceptually, the reason is that mode 10 exists only for the purpose of getting rid of stuff.)
So it is an interesting control structure that can be exploited to detect changes, typically data changes, and take action on those changes.
Its use in graphical user interfaces is to keep some set of controls or other objects in agreement with program state information. For that use, the three modes are called SHOW(01), UPDATE(11), and ERASE(10).
The procedure is initially executed in SHOW mode, in which controls are created, and information relevant to them populates the FIFO.
Then any number of executions are made in UPDATE mode, where the controls are modified as necessary to stay up to date with program state.
Finally, there is an execution in ERASE mode, in which the controls are removed from the UI, and the FIFO is emptied.
The benefit of doing this is that, once you've written the procedure to create all the controls, as a function of the program's state, you don't have to write anything else to keep it updated or clean up afterward.
Anything you don't have to write means less opportunity to make mistakes.
(There is a straightforward way to handle user input events without having to write event handlers and create names for them. This is explained in one of the videos linked below.)
In terms of memory management, you don't have to make up variable names or data structure to hold the controls. It only uses enough storage at any one time for the currently visible controls, while the potentially visible controls can be unlimited. Also, there is never any concern about garbage collection of previously used controls - the FIFO acts as an automatic garbage collector.
In terms of performance, when it is creating, deleting, or modifying controls, it is spending time that needs to be spent anyway.
When it is simply updating controls, and there is no change, the cycles needed to do the reading, writing, and comparison are microscopic compared to altering controls.
Another performance and correctness consideration, relative to systems that update displays in response to events, is that such a system requires that every event be responded to, and none twice, otherwise the display will be incorrect, even though some event sequences may be self-canceling. Under differential execution, update passes may be performed as often or as seldom as desired, and the display is always correct at the end of a pass.
Here is an extremely abbreviated example where there are 4 buttons, of which buttons 2 and 3 are conditional on a boolean variable.
In the first pass, in Show mode, the boolean is false, so only buttons 1 and 4 appear.
Then the boolean is set to true and pass 2 is performed in Update mode, in which buttons 2 and 3 are instantiated and button 4 is moved, giving the same result as if the boolean had been true on the first pass.
Then the boolean is set false and pass 3 is performed in Update mode, causing buttons 2 and 3 to be removed and button 4 to move back up to where it was before.
Finally pass 4 is done in Erase mode, causing everything to disappear.
(In this example, the changes are undone in the reverse order as they were done, but that is not necessary. Changes can be made and unmade in any order.)
Note that, at all times, the FIFO, consisting of Old and New concatenated together, contains exactly the parameters of the visible buttons plus the boolean value.
The point of this is to show how a single "paint" procedure can also be used, without change, for arbitrary automatic incremental updating and erasing.
I hope it is clear that it works for arbitrary depth of sub-procedure calls, and arbitrary nesting of conditionals, including switch, while and for loops, calling pointer-based functions, etc.
If I have to explain that, then I'm open to potshots for making the explanation too complicated.
Finally, there are couple crude but short videos posted here.
** Technically, they have to read the same number of bytes they wrote last time. So, for example, they might have written a string preceded by a character count, and that's OK.
ADDED: It took me a long time to be sure this would always work.
I finally proved it.
It is based on a Sync property, roughly meaning that at any point in the program the number of bytes written on the prior pass equals the number read on the subsequent pass.
The idea behind the proof is to do it by induction on program length.
The toughest case to prove is the case of a section of program consisting of s1 followed by an IF(test) s2 ENDIF, where s1 and s2 are subsections of the program, each satisfying the Sync property.
To do it in text-only is eye-glazing, but here I've tried to diagram it:
It defines the Sync property, and shows the number of bytes written and read at each point in the code, and shows that they are equal.
The key points are that 1) the value of the test expression (0 or 1) read on the current pass must equal the value written on the prior pass, and 2) the condition of Sync(s2) is satisfied.
This satisfies the Sync property for the combined program.
I read all the stuff I can find and watched the video and will take a shot at a first-principles description.
Overview
This is a DSL-based design pattern for implementing user interfaces and perhaps other state-oriented subsystems in a clean, efficient manner. It focuses on the problem of changing the GUI configuration to match current program state, where that state includes the condition of GUI widgets themselves, e.g. the user selects tabs, radio buttons, and menu items, and widgets appear/disappear in arbitrarily complex ways.
Description
The pattern assumes:
A global collection C of objects that needs periodic updates.
A family of types for those objects, where instances have parameters.
A set of operations on C:
Add A P - Put a new object A into C with parameters P.
Modify A P - Change the parameters of object A in C to P.
Delete A - Remove object A from C.
An update of C consists of a sequence of such operations to transform C to a given target collection, say C'.
Given current collection C and target C', the goal is to find the update with minimum cost. Each operation has unit cost.
The set of possible collections is described in a domain-specific language (DSL) that has the following commands:
Create A H - Instantiate some object A, using optional hints H, and add it to the global state. (Note no parameters here.)
If B Then T Else F - Conditionally execute command sequence T or F based on Boolean function B, which can depend on anything in the running program.
In all the examples,
The global state is a GUI screen or window.
The objects are UI widgets. Types are button, dropdown box, text field, ...
Parameters control widget appearance and behavior.
Each update consists of adding, deleting, and modifying (e.g. relocating) any number of widgets in the GUI.
The Create commands are making widgets: buttons, dropdown boxes, ...
The Boolean functions depend on the underlying program state including the condition of GUI controls themselves. So changing a control can affect the screen.
Missing links
The inventor never explicitly states it, but a key idea is that we run the DSL interpreter over the program that represents all possible target collections (screens) every time we expect any combination of the Boolean function values B has changed. The interpreter handles the dirty work of making the collection (screen) consistent with the new B values by emitting a sequence of Add, Delete, and Modify operations.
There is a final hidden assumption: The DSL interpreter includes some algorithm that can provide the parameters for the Add and Modify operations based on the history of Creates executed so far during its current run. In the GUI context, this is the layout algorithm, and the Create hints are layout hints.
Punch line
The power of the technique lies in the way complexity is encapsulated in the DSL interpreter. A stupid interpreter would start by Deleting all the objects (widgets) in the collection (screen), then Add a new one for each Create command as it sees them while stepping through the DSL program. Modify would never occur.
Differential execution is just a smarter strategy for the interpreter. It amounts to keeping a serialized recording of the interpreter's last execution. This makes sense because the recording captures what's currently on the screen. During the current run, the interpreter consults the recording to make decisions about how to bring about the target collection (widget configuration) with operations having least cost. This boils out to never Deleting an object (widget) only to Add it again later for a cost of 2. DE will always Modify instead, which has a cost of 1. If we happen to run the interpreter in some case where the B values have not changed, the DE algorithm will generate no operations at all: the recorded stream already represents the target.
As the interpreter executes commands, it is also setting up the recording for its next run.
An analogous algorithm
The algorithm has the same flavor as minimum edit distance (MED). However DE is a simpler problem than MED because there are no "repeated characters" in the DE serialized execution strings as there are in MED. This means we can find an optimal solution with a straightforward on-line greedy algorithm rather than dynamic programming. That's what the inventor's algorithm does.
Strengths
My take is that this is a good pattern for implementing systems with many complex forms where you want total control over placement of widgets with your own layout algorithm and/or the "if else" logic of what's visible is deeply nested. If there are K nests of "if elses" N deep in the form logic, then there are K*2^N different layouts to get right. Traditional form design systems (at least the ones I've used) don't support larger K, N values very well at all. You tend to end up with large numbers of similar layouts and ad hoc logic to select them that's ugly and hard to maintain. This DSL pattern seems a way to avoid all that. In systems with enough forms to offset the DSL interpreter's cost, it would even be cheaper during initial implementation. Separation of concerns is also a strength. The DSL programs abstract the content of forms while the interpreter is the layout strategy, acting on hints from the DSL. Getting the DSL and layout hint design right seems like a significant and cool problem in itself.
Questionable...
I'm not sure that avoiding Delete/Add pairs in favor of Modify is worth all the trouble in modern systems. The inventor seems most proud of this optimization, but the more important idea is a concise DSL with conditionals to represent forms, with layout complexity isolated in the DSL interpreter.
Recap
The inventor's has so far has focused on deep details of how the interpreter makes its decisions. This is confusing because it's directed at trees while the forest is of greater interest. This is a description of the forest.

cache behaviour on redundant writes

Edit - I guess the question I asked was too long so I'm making it very specific.
Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?
For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?
Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.
Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )
I am not aware of any processor that does the optimization you describe - eliminating writes to clean cache lines that would not change the value - but it's a good question, a good idea, great minds think alike and all that.
I wrote a great big reply, and then I remembered: this is called "Silent Stores" in the literature. See "Silent Stores for Free", K. Lepak and M Lipasti, UWisc, MICRO-33, 2000.
Anyway, in my reply I described some of the implementation issues.
By the way, topics like this are often discussed in the USEnet newsgroup comp.arch.
I also write about them on my wiki, http://comp-arch.net
Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:
The old value at the location is loaded from the cache to the CPU (assuming it is already in the cache).
The old and new values are compared.
If the old and new values are different, the new value is written to the cache. Otherwise it is ignored.
Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.
Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.
So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.
The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.
http://blogs.amd.com/developer/tag/sse4a/
Read
Cache hit: Data is read from the cache line to the target register
Cache miss: Data is moved from memory to the cache, and read into the target register
Write
Cache hit: Data is moved from the register to the cache line
Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line
This is not an answer to your original question on computer-architecture, but might be relevant to your goal.
In this discussion, all array index starts with zero.
Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:
An array of size
An array of n, and a counter, used to emulate a set container. Duplicate values allowed.
Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.
When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.
A similar technique, known as a two-level histogram or radix histogram, can also be used.
Two pieces of information are stored:
An array of size
An boolean array of ceil(size / M), where M is the radix. ceil is the ceiling function.
Every time a non-zero value is written to index k in the full-size array, the element floor(k / M) in the boolean array should be marked.
Let's say, bool_array[j] is marked. This corresponds to the range from j*M to (j+1)*M-1 in the full-size array.
When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.