synchronizes-with, happens-before relation and acquire-release semantics - language-agnostic

I need help in understanding synchronizes-with relation. The more I'm reading it an trying to understand example, the more I feel that I understand nothing. Sometimes I feel that here is it, I've got it, but after looking at another example I get confused again. Please help me to get it right.
It's said that an operation A synchronizes-with an operation B if A is a store to some atomic variable m, with release semantics, B is a load from the same variable m, with acquire semantics, and B reads the value stored by A.
It's also said that an operation A happens-before an operation B if
A is performed on the same thread as B, and A is before B in program order, or
A synchronizes-with B, or
A happens-before some other operation C, and C happens-before B
OK. If we look at this example
thread0 performs | thread1 performs
store x (release) | load x (acquire)
does store to x here synchronize-with load from x? If we do have synchronizes-with relationship here, then store to x happens before load from x, so everything what is sequenced before store to x in thread 0 happens-before load from x in thread 1. It means there is enforced ordering here. Is it right? But in this case I don't understand what does the "and B reads the value stored by A" part of definition mean? If thread 1 is faster then thread 0 it may read the old value. So what is the relationship here and is there any relationship at all? If there is no, how can I provide that relationship?
Thanks in advance.

I cannot say that I am well familiar with the terminology, but this is how I think it goes. I will use the .NET definitions for terms: "An operation has acquire semantics if other processors will always see its effect before any subsequent operation's effect. An operation has release semantics if other processors will see every preceding operation's effect before the effect of the operation itself."
There is no enforced ordering between the store and load in the example. Either one may be executed first. A synchronizes-with B when the store operation happens to be executed before load. When this happens, all operations before the store (release) are guaranteed to be executed before the operations after the load (acquire) are executed.
However, the load operation may be executed before the store. In that case A does not synchronize-with B (as the condition "and B reads the value stored by A" is not true) and so the operations after load may be executed before the operations that precede the store. The order is vague.
The release semantics guarantees that for certain value of x we will know that the operations preceding the store will be executed before we are able to load the same stored value in the second thread (and before we are able to execute the operations following the load).
Let's assume that count and flag are initialized to zero and both threads are run in parallel:
thread0:
st count, 1 (A)
st.release flag, 1 (B)
thread1:
ld.acquire flag (C)
ld count (D)
We know that A happens-before B and C happens-before D, because their order is forced by the release and acquire semantics. The order of B and C is undefined. It only becomes defined when B synchronizes-with C and then we know that A happens-before D (as A happens-before B happens-before C happens-before D).
In thread1 the count is always 1 if flag is 1. If flag is 0 the count may be either 0 or 1. We could test the flag to determine if the other thread has set the value to the count.
Without acquire and release semantics the loads and stores could be reordered and both the flag and count could be either 0 or 1 with no dependency to each other.
If we want to ensure that B happens-before C, we can use semaphores or some other wait-and-signal mechanism. In previous example we could enforce the order by busy-waiting the flag to be set.
thread1:
ld.acquire flag (C)
repeat C while flag == 0
ld count (D)

Related

Is the flow of data inside the data flow tasks synchronous?

In Data flow task suppose I have source, couple of transforms and destination.
Say there are 1 million records to be read by source. Say it has reached row number 10000. Will the read rows (10000) get passed to the next transform or will the subsequent tasks wait for the previous task to completely process the rows? So only run transform when all 1 million have been read.
It depends!
Quick definitions:
Synchronous: One row in, one row out. Input lineage id is the same as output lineage id
Asynchronous: N row(s) in, M row(s) out.
Async, Fully blocking: All data must arrive before new data can leave
Async, Partially blocking: Enough data must arrive before new data can leave
All synchronous
OLE DB Source -> Derived Column -> OLE DB Destination
All synchronous components. 1M source rows, 10k rows flow from source, to column to destination. Lather-rinse-repeat
Asynchronous, fully blocking
OLE DB Source ->
Aggregate -> OLE DB Destination
Aggregate is an asynchronous, fully-blocking, component. 1M source rows, 10k rows flow from source to Aggregate (let's assume we're getting the Maximum sales grouped by sales id). It computes the maximum amount for the 10k it has, but it can't release them downstream because the 10k+1 row might have a larger value so it holds and stores the values until it has received an end of buffer signal from the source.
Only then, can the Aggregate component release the results to the downstream consumers of data.
I show the Aggregate not being "in line" with the source because at this point in a data flow, there is a rift between the data before and the data after. If it had been Source -> Derived column -> Aggregate, the Derived component is going to work on the same memory address (yay pointers) as the Source. Asynchronous components do an in memory copy of the data into a separate memory space. So, instead of being able to allocate 1GB to your data flow, it has to spend .5GB to the first half and .5 to the last half.
If you rename a column "upstream" from an asynchronous component, you can tell where the "break" in data lineage is because that column won't be available to the final destination until you modify all the async components between the source to the destination.
Asynchronous, partially blocking
OLE Source DB 1 -->
Merge Join -> OLE DB Destination
OLE Source DB 2 -->
Merge join is an asynchronous, partially blocking component. You can usually tell the partially blocking components as they have a requirement of Sorted input. For an aggregate, we have to have all the data before we can say this is the maximum value. With a merge join, since we know that both streams are sorted on key, we can release once the match key is out of matches. Assume I have the merge in an INNER JOIN configuration. If I have the rows with value of A,B,C from db1 feed and A,C from db2. While A matches A, we'll release rows to the destination. We exhaust As and go to the next. Source 1 provides B, Source 2 provides C. They don't match so B is discarded and the next Source 1 is retrieved. C matches so it continues on.
It Depends(again)
OLE DB Source -> Script Component -> OLE DB Destination
OLE DB Source ->
Script Component -> OLE DB Destination
A script component operates the way you define it. The default is synchronous but you can make it async.
Jorg has a good table flagging the components into the various buckets: https://jorgklein.com/2008/02/28/ssis-non-blocking-semi-blocking-and-fully-blocking-components/comment-page-2/
The comment asks "What about a lookup transform?"
As the referenced article shows, Lookup is in Synchronous column. However, if one if looking for performance bottlenecks (async components are usually the first place I look), we often point out that the default lookup will cache all data in the table in the PreExecute phase. If your reference table has 10, 100, 1000000 rows, who cares. However long it takes to run SELECT * FROM MyTable and stream that from the database source to the machine SSIS is running on is the performance penalty you pay.
However, if you work at a mutual fund company and have a trade table that records prices for all of your stocks for all time, maybe don't try and pull that data back for a lookup transform, hypothetically speaking of course. Maybe you only needed to get the most recent settlement price so don't be lazy and pull more data than you'll ever need and/or crash the machine.

How to achieve database locking based protection against duplicates

In a web application, using the InnoDB storage engine, I was unable to adequately utilise database locking in the following scenario.
There are 3 tables, I will call them aa, ar and ai.
aa holds the base records, let's say articles. ar holds information related to each aa record and the relation between aa and ar is 1:m.
Records in ar are stored when a record from aa is read the first time. The problem is that when two requests are initiated at (nearly) the same to read a record from aa (which does not yet have its related records stored in ar), the ar records are duplicated.
Here is a pseudo code to help understand the situation:
Read the requested aa record.
Scan the ar table to find out if the given aa record has anything stored already. (Assume it has not.)
Consult ai in order to find out what is to be stored in ar for the given aa record. (ai seems somewhat irrelevant, but I found that it too has to be involved in the locking… may be wrong.)
Insert a few rows to ar
Here is what I want to achieve:
Read the requested aa record.
WITH OR WITHOUT USING A TRANSACTIONS, LOCK ar, SO ANY SUBSEQUENT REQUEST ATTEMPTING TO READ FROM ar WILL WAIT AT THIS POINT UNTIL THIS ONE FINISHES.
Scan the ar table to find out if the given aa record has anything stored already. (Assume it has not.) The problem is that in case of two simultaneous requests, both find there are no records in ar for the given aa record and they both proceed to insert the same rows twice. Otherwise, if there are, this sequence is interrupted and no INSERT occurs.
Consult ai in order to find out what is to be stored in ar for the given aa record. (ai seems somewhat irrelevant, but I found that it too has to be involved in the locking… may be wrong.)
Insert a few rows to ar
RELEASE THE LOCK ON ar
Seems simple enough, I was unsuccessful in avoiding the duplicates. I'm testing the simultaneous requests from a simple command in a Bash shell (using wget).
I have spent a while learning how exactly locking works with the InnoDB engine here http://dev.mysql.com/doc/refman/5.5/en/innodb-lock-modes.html and here http://dev.mysql.com/doc/refman/5.5/en/innodb-locking-reads.html and tried several ways to utilise the lock(s), still no luck.
I want the entire ar table locked (since I want to prevent INSERTs from multiple request to occur to it) causing further attempts to interact with this table to wait until the first lock is released. But there's only one mention of "entire table" being locked in the documentation (Intention Locks section in the first linked page) but that's not further discussed or I was unable to figure how to achieve it.
Could anyone point in the right direction?
SET tx_isolation='READ-COMMITTED';
START TRANSACTION;
SELECT * FROM aa WHERE id = 1234 FOR UPDATE;
This ensures that only one thread gets access to a given row in aa at a time. No need to lock the ar table at all, because any other thread who may want to access row 1234 will wait.
Then query ar to find out what rows exist for the corresponding aa, and decide if you want to insert more rows to ar.
Remember that the row in aa is still locked. So be a good citizen by finishing your work quickly, and COMMIT promptly.
COMMIT;
This allows the next thread who has been waiting for the same row of aa to proceed. By using READ-COMMITTED, it will be able to see the just-committed new rows in ar.

Is there a way to query the ordering of two cuda events?

Suppose we recorded two cuda events A and B by calling to cudaEventRecord, then before we do any synchronization, is there any way to judge whether A will necessarily happen before or after B? For example, if I have these code:
kernelA<<<1,1>>>(...);
cudaEventRecord(A, 0);
kernelB<<<1,1>>>(...);
cudaEventRecord(B, 0);
Then B should happen after A for sure, but how would I know this given the two handles? In another way, how would I write a function like this:
bool judge_order(cudaEvent_t A, cudaEvent_t B) {...}
Such that it returns true if A would happen before B.
The question arises when I want to make a memory manager in order to effectively reuse the memory that are already used in the previous kernel launches.
Everything in CUDA is scheduled on streams. This includes kernel execution, memory transfer, and events. By default everything operates on stream 0.
Each stream is processed stricly linear. I.e. in your example it is guaranteed that kernelA has been completed before eventA is processed. By querying the status of an event you can tell if it has been processd without waiting for it.
Separate streams however can be processed in any order. If you would use a separate stream for each of your kernels/events then no particular processing order is guaranteed.
All of this is much better explained in the CUDA programming guide.

What's the difference between safe, regular and atomic registers?

Can anyone provide exhaustive explanation, please? I'm diving into concurrent programming and met those registers while trying to understand consensus.
From Lamport's "On interprocess communication": ...a regular register is atomic if two successive reads that overlap the same write cannot obtain the new then the old value....
Assume, that first comes thread0.write(0) - with no overlapping. Basically, one can say using Lamports definition that thread1 can read first 1 and then 0 again, if both reads are consequent and overlap with thread0.write(1). But how is that possible?
Reads and writes to a shared memory location take a finite period of time, so they may either overlap, or be completely distinct.
e.g.
Thread 1: wwwww wwwww
Thread 2: rrrrr rrrrr
Thread 3: rrrrr rrrrr
The first read from thread 2 overlaps with the first write from thread 1, whilst the second read and second write do not overlap. In thread 3, both reads overlap the first write.
A safe register is only safe as far as reads that do not overlap writes. If a read does not overlap any writes then it must read the value written by the most recent write. Otherwise it may return any value that the register may hold. So, in thread 2, the second read must return the value written by the second write, but the first read can return any valid value.
A regular register adds the additional guarantee that if a read overlaps with a write then it will either read the old value or the new one, but multiple reads that overlap the write do not have to agree on which, and the value may appear to "flicker" back and forth. This means that two reads from the same thread (such as in thread 3 above) that both overlap the write may appear "out of order": the earlier read returning the new value, and the later returning the old value.
An atomic register guarantees that the reads and writes appears to happen at a single point in time. Readers that act at a point before that point will all read the old value and readers that act after that point will all read the new value. In particular, if two reads from the same thread overlap a write then the later read cannot return the old value if the earlier read returns the new one. Atomic registers are linearizable.
The Art of Multiprocessor Programming by Maurice Herlihy and Nir Shavit gives a good description, along with examples and use cases.

cache behaviour on redundant writes

Edit - I guess the question I asked was too long so I'm making it very specific.
Question: If a memory location is in the L1 cache and not marked dirty. Suppose it has a value X. What happens if you try to write X to the same location? Is there any CPU that would see that such a write is redundant and skip it?
For example is there an optimization which compares the two values and discards a redundant write back to the main memory? Specifically how do mainstream processors handle this? What about when the value is a special value like 0? If there's no such optimization even for a special value like 0, is there a reason?
Motivation: We have a buffer that can easily fit in the cache. Multiple threads could potentially use it by recycling amongst themselves. Each use involves writing to n locations (not necessarily contiguous) in the buffer. Recycling simply implies setting all values to 0. Each time we recycle, size-n locations are already 0. To me it seems (intuitively) that avoiding so many redundant write backs would make the recycling process faster and hence the question.
Doing this in code wouldn't make sense, since branch instruction itself might cause an unnecessary cache miss (if (buf[i]) {...} )
I am not aware of any processor that does the optimization you describe - eliminating writes to clean cache lines that would not change the value - but it's a good question, a good idea, great minds think alike and all that.
I wrote a great big reply, and then I remembered: this is called "Silent Stores" in the literature. See "Silent Stores for Free", K. Lepak and M Lipasti, UWisc, MICRO-33, 2000.
Anyway, in my reply I described some of the implementation issues.
By the way, topics like this are often discussed in the USEnet newsgroup comp.arch.
I also write about them on my wiki, http://comp-arch.net
Your suggested hardware optimization would not reduce the latency. Consider the operations at the lowest level:
The old value at the location is loaded from the cache to the CPU (assuming it is already in the cache).
The old and new values are compared.
If the old and new values are different, the new value is written to the cache. Otherwise it is ignored.
Step 1 may actually take longer time than steps 2 and 3. It is because steps 2 and 3 cannot start until the old value from step 1 has been brought into the CPU. The situation would be the same if it was implemented in software.
Consider if we simply write the new values to the cache, without checking the old value. It is actually faster than the three-step process mentioned above, for two reasons. Firstly, there is no need to wait for the old value. Secondly, the CPU can simply schedule the write operation in an output buffer. The output buffer can perform the cache write simutaneously while the ALU can start working on something else.
So far, the only latencies involved are that of between the CPU and the cache, not between the cache and the main memory.
The situation is more complicated in modern-day microprocessors, because their cache is organized into cache-lines. When a byte value is written to a cache-line, the complete cache-line has to be loaded because the other part of the cache-line that is not rewritten has to keep its old values.
http://blogs.amd.com/developer/tag/sse4a/
Read
Cache hit: Data is read from the cache line to the target register
Cache miss: Data is moved from memory to the cache, and read into the target register
Write
Cache hit: Data is moved from the register to the cache line
Cache miss: The cache line is fetched into the cache, and the data from the register is moved to the cache line
This is not an answer to your original question on computer-architecture, but might be relevant to your goal.
In this discussion, all array index starts with zero.
Assuming n is much smaller than size, change your algorithm so that it saves two pieces of information:
An array of size
An array of n, and a counter, used to emulate a set container. Duplicate values allowed.
Every time a non-zero value is written to the index k in the full-size array, insert the value k to the set container.
When the full-size array needs to be cleared, get each value stored in the set container (which will contain k, among others), and set each corresponding index in the full-size array to zero.
A similar technique, known as a two-level histogram or radix histogram, can also be used.
Two pieces of information are stored:
An array of size
An boolean array of ceil(size / M), where M is the radix. ceil is the ceiling function.
Every time a non-zero value is written to index k in the full-size array, the element floor(k / M) in the boolean array should be marked.
Let's say, bool_array[j] is marked. This corresponds to the range from j*M to (j+1)*M-1 in the full-size array.
When the full-size array needs to be cleared, scan the boolean array for any marked elements, and its corresponding range in the full-size array should be cleared.