Concurrency and memory models - language-agnostic

I'm watching this video by Herb Sutter on GPGPU and the new C++ AMP library. He is talking about memory models and mentions Weak Memory Models and then Strong Memory Models and I think he's referring to read/write ordering etc, but I am however not sure.
Google turns up some interesting results (mostly science papers) on memory models, but can someone explain what is a Weak Memory Model and what is a Strong Memory Model and their relation to concurrency?

In terms of concurrency, a memory model specifies the constraints on data accesses, and the conditions under which data written by one thread/core/processor becomes visible to another.
The terms weak and strong are somewhat ambiguous, but the basic premise is that a strong memory model places a lot of constraints on the hardware to ensure that writes by one thread/core/processor are visible to other threads/cores/processors in clearly-defined orders, whilst allowing the programmer maximum freedom of data access.
On the other hand, a weak model places very little constraints on the hardware, but instead places the responsibility of ensuring visibility in the hands of the programmer.
The strongest memory model is Sequential Consistency: all operations to all data by all processors form a single total order agreed on by all processors, which is consistent with the order of operations on each processor individually. This is essentially an interleaving of the operations of each processor.
The weakest memory model will not impose any restrictions on the order that processors see each other's writes. Different processors in the same system may see writes in different orders, and some processors may use "stale" data from their own cache for a long time after a write to the same memory address by another processor. Sometimes, whole cache lines are treated as a single unit, so a write to one variable on a cache line will cause writes from other processors to other variables on that cache line that are not yet visible to the first processor to be effectively discarded, as the stale values are written over the top when it eventually writes the cache line to memory. Under such a scheme, extreme care must be taken to ensure that data is transferred to other processors in the correct order, using explicit synchronization instructions.
For example, the Intel x86 memory model is generally considered to be on the stronger end, as there are strict rules about the order in which writes become visible to other processors, whereas the DEC Alpha and ARM processors are generally considered to have weak memory models, as writes from one processor are only required to be visible to other processors in a particular order if you explicitly put ordering instructions (memory fences or barriers) in your code.
Some systems have memory that is only accessible by particular processors. Transferring data between these processors therefore requires explicit data transfer instructions. This is the case with the Cell processors, and is often the case with GPUs as well. This can be viewed as an extreme of a weak memory model --- data is only visible to other processors if you explicitly invoke the data transfer.
Programming languages usually impose their own memory models on top of whatever is provided by the underlying processors. For example, C++0x specifies a complete set of ordering constraints ranging from completely relaxed to full sequential consistency, so you can specify in code what you require. On the other hand, Java has a very specific set of ordering constraints that must be adhered to and cannot be varied. In both cases the compiler must translate the desired constraints into the relevant instructions for the underlying processor, which may be quite involved if you request sequential consistency on a weakly ordered machine.

The two terms aren't clearly defined, and it's not a black/white thing.
Memory models can be extremely weak, extremely strong, or anywhere in between.
It basically refers to the guarantees offered about concurrent memory accesses.
Naively, you would expect a write made on one thread, to be immediately visible to all other threads. And you would expect events to appear in the same order on all threads as well.
But in a weaker memory model, neither of those may hold.
Sequential consistency is the term for a memory model which guarantees that events are seen in the same order across all threads. So a memory model which ensures sequential consistency is pretty strong.
A weaker guarantee is causal consistency: the guarantee that events are observed after the events they depend on.
In other words, if you first write a value x to some address A, and then write a second value y to the same address, then no thread will ever read the value y after reading the x value. Because the two writes are to the same address, it would violate causal consistency if not all threads observed the same order.
But this says nothing about what should happen to unrelated events. The result of writing a third value to a different memory address could be observed at absolutely any time by other threads (so different threads may observe events in a different order, unlike under sequential consistency)
There are plenty other such levels of "consistency", some stronger, some weaker, and offering all sorts of subtle guarantees about what you can rely on.
Fundamentally, a stronger memory model is going to offer more guarantees about the order in which events are observed, and will normally guarantee behavior closer to what you'd intuitively expect.
But a weaker model allows more room for optimization, and especially, it scales better with more cores (because less synchronization is required)
Sequential consistency is basically free on a single-core CPU, is doable on a quad-core, but would be prohibitively expensive on a 32-core system, or a system with 4 physical CPUs. Or a shared-memory system between multiple physical machines.
The more cores you have, and the further apart they are, the harder it is to ensure that they all observe events in the same order. So compromises are made, and you settle for a weaker memory model which makes looser guarantees.

Yes, you are right - the difference between Weak and Strong memory models is a difference in what optimizations are available (order of reads/write and related fences).
You can specify a memory model by starting with a sequentially consistent model (the most restrictive, or strongest model), and then specify how reads and writes from a single thread can be introduced, removed, or moved with respect to one another
In this model (sequentially consistent) the memory is independent of any of the processors (threads) that use it. The memory is connected to each of the threads by a controller that feeds read and write requests from each thread. The reads and writes from a single thread reach memory in exactly the order specified by the thread, but they might be interleaved with reads and writes from other threads in an unspecified way
Understand the Impact of Low-Lock Techniques in Multithreaded Apps
However there's no exact bound between strong and weak memory models, unless you consider sequentilly consistent model vs others. Some of them are just stronger/weaker and therefore more open to optimizations by reordering than others. For example, memory model in .NET 2.0 for x86 allows a bit more optimizations that the verison in .NET 1.1 so it can be considered as a weaker model.

Google turns up some interesting results (mostly science papers) on memory models, but can someone explain what is a Weak Memory Model and what is a Strong Memory Model and their relation to concurrency?
A strong memory model is one where, from the point of view of other cores, reads and writes appear to happen as they appear in the program and, in particular, in the order in which they appear in the program. This is known as sequential consistency.
A weak memory model is one where memory executions may be changed by the CPU, e.g. reordered. All practical CPU architectures allow instructions to be reordered.
Note that Herb Sutter uses "strong memory model" to mean one where atomic intrinsics are not reordered. This is not the commonly accepted definition.

Related

Most generally correct way of updating a vertex buffer in Vulkan

Assume a vertex buffer in device memory and a staging buffer that's host coherent and visible. Also assume a desktop system with a discrete GPU (so separate memories). And lastly, assume correct inter-frame synchronization.
I see two general possible ways of updating a vertex buffer:
Map + memcpy + unmap into the staging buffer, followed by a transient (single command) command buffer that contains a vkCmdCopyBuffer, submit it to the graphics queue and wait for the queue to idle, then free the transient command buffer. After that submit the regular frame draw queue to the graphics queue as usual. This is the code used on https://vulkan-tutorial.com (for example, this .cpp file).
Similar to above, only instead use additional semaphores to signal after the staging buffer copy submit, and wait in the regular frame draw submit, thus skipping the "wait-for-idle" command.
#2 sort of makes sense to me, and I've repeatedly read not to do any "wait-for-idle" operations in Vulkan because it synchronizes the CPU with the GPU, but I've never seen it used in any tutorial or example online. What do the pros usually do if the vertex buffer has to be updated relatively often?
First, if you allocated coherent memory, then you almost certainly did so in order to access it from the CPU. Which requires mapping it. Vulkan is not OpenGL; there is no requirement that memory be unmapped before it can be used (and OpenGL doesn't even have that requirement anymore).
Unmapping memory should only ever be done when you are about to delete the memory allocation itself.
Second, if you think of an idea that involves having the CPU wait for a queue or device to idle before proceeding, then you have come up with a bad idea and should use a different one. The only time you should wait for a device to idle is when you want to destroy the device.
Tutorial code should not be trusted to give best practices. It is often intended to be simple, to make it easy to understand a concept. Simple Vulkan code often gets in the way of performance (and if you don't care about performance, you shouldn't be using Vulkan).
In any case, there is no "most generally correct way" to do most things in Vulkan. There are lots of definitely incorrect ways, but no "generally do this" advice. Vulkan is a low-level, explicit API, and the result of that is that you need to apply Vulkan's tools to your specific circumstances. And maybe profile on different hardware.
For example, if you're generating completely new vertex data every frame, it may be better to see if the implementation can read vertex data directly from coherent memory, so that there's no need for a staging buffer at all. Yes, the reads may be slower, but the overall process may be faster than a transfer followed by a read.
Then again, it may not. It may be faster on some hardware, and slower on others. And some hardware may not allow you to use coherent memory for any buffer that has the vertex input usage at all. And even if it's allowed, you may be able to do other work during the transfer, and thus the GPU spends minimal time waiting before reading the transferred data. And some hardware has a small pool of device-local memory which you can directly write to from the CPU; this memory is meant for these kinds of streaming applications.
If you are going to do staging however, then your choices are primarily about which queue you submit the transfer operation on (assuming the hardware has multiple queues). And this primarily relates to how much latency you're willing to endure.
For example, if you're streaming data for a large terrain system, then it's probably OK if it takes a frame or two for the vertex data to be usable on the GPU. In that case, you should look for an alternative, transfer-only queue on which to perform the copy from the staging buffer to the primary memory. If you do, then you'll need to make sure that later commands which use the eventual results synchronize with that queue, which will need to be done via a semaphore.
If you're in a low-latency scenario where the data being transferred needs to be used this frame, then it may be better to submit both to the same queue. You could use an event to synchronize them rather than a semaphore. But you should also endeavor to put some kind of unrelated work between the transfer and the rendering operation, so that you can take advantage of some degree of parallelism in operations.

MIPS Architecture : NOP (No-Operation) Vs Data Forwarding in Hazard Prevention

I learnt in computer architecture course that, data hazard can be prevented by using several arbitrary, independent nop instructions in between two mutually dependent instructions. This can be done at assembly level in compiler design.
The alternative way to avoid data hazard is to use data forwarding.
I am bit confused, How these two alternatives differ as far as performance, speed and hardware is concerned. Because as per my knowledge data forwarding is to be implemented at hardware level, whereas nop can be implemented at assembly level.
Anybody please explain me which approach is better if we consider factors such as performance, speed, hardware etc?
Thanks.
Obviously, having the compiler insert nops into the code stream to fill pipeline slots allows hardware to be simplified which can reduce the duration of a pipeline stage or the depth of the pipeline, reduce design effort (time to market, project risk, design cost), or allow a full processor core to fit on a single chip (which helps performance). However, this benefit is tiny compared to the loss of performance from not using forwarding. Higher latency for dependent instructions is very bad for typical programs.
The MIPS R2000, which had both delayed branches and delayed loads, provided result forwarding. (MIPS is an acronym for "Microprocessor without Interlocked Pipeline Stages"). Delayed loads were soon removed from MIPS (which was possible because such did not affect binary compatibility of correct code). The use of delayed instructions was partially from a belief that most delay slots could be filled by the compiler with useful instructions and partially from believing that the increase in code size was not important relative to the simplification of hardware.
Reducing the latency of a load operation was not practical, so the pipeline would need to be stalled for a cycle anyway. The cost of a nop is in cache and memory capacity effects (i.e., the effect of lower code density), and in some cases a single load delay slot could be filled.
Exposing the pipeline organization also has implications for binary compatibility. Later binary compatible implementations must accommodate the ISA designed for the original pipeline organization. A single delayed branch slot works reasonably well for a simple 5-stage scalar implementation (it can be filled with a useful instruction most of the time and allows zero-effective-delay branches [i.e., no stall to resolve the branch or prediction and flushing the pipeline on misprediction]), but when the pipeline is deepened (or made wider) prediction or stalling becomes necessary anyway.
If sufficient parallelism exists in the targeted workloads, hardware simplicity is sufficiently important, and binary compatibility is not a problem, then exposing a pipeline with minimal support for dynamically detecting and handling stall conditions may be sensible. (There are also ways of encoding nops that avoid most of the code size expansion issues.) Having reliably sufficient parallelism (whether instruction-level or thread-level) allows the avoiding of nops; by compiler scheduling with instruction-level parallelism or by hardware thread interleaving with thread-level parallelism.
Hardware simplicity tends to reduce energy per unit of work (as well as chip area), and many modern designs are limited by power use. It also makes sense to perform optimizations at compile time (when they are less latency critical and can be done once rather than each time the code is executed) if the storage and communication cost of additional information is not too expensive (assuming information necessary to perform the optimization is available at compile time [dynamic branch prediction is a classic example of where dynamic information is helpful]).
Well, basically since hardware is optimised with feed forwarding, there has to be no use of explicitly declared software NOPs. But that's not the case.
Though, feed forwarding proves helpful in reducing data hazards, but some hazards cannot be dealt with feed forwarding. It just isn't possible.
Eg.
beq R1,R5,label
instruction 2nd
Here the instruction 2nd will not be fetched until instruction 1 has completed its execution stage and decided whether or not to branch. Until then the 2nd instruction has to be stalled. (stalled for 2 memory cycles). This is done by software by sending out NOPs.
With improvements in technology and hardware optimizations, the beq instruction can complete its execution stage in its register fetch/decode stage by inserting a comparator in the fetch stage itself. Even so, the 2nd instruction will be stalled for(1 memory cycle now). Again NOP is needed.

Memory coalescing and transaction

After reading about the topic, I have 2 questions related to Global Memory coalescing access:
1- I read that one requirement for Memory coalescing is that words accessed by the threads must be 4, 8, or 16 byte but apparently this is valid only for device with compute capability less than 1.3. Is that right? for the latter device (>=1.3), a thread can even access one or 2 bytes and have a successful coalesced memory access
2- Will it matter (time mainly) if a (half) warp Global Memory access generates a 128-byte instead of 64-byte memory transaction because of the words misalignment and what about the extra data transferred, will it be discarded by the system?
Thank you
1) You can access the data any way you want on later devices, but the performance will still be poor if you request a data segment that is narrow, i.e. you will not achieve the full memory bandwidth of your GPU.
2) This again depends on the overall scheme of you code. Generally, the improvement in later version of CUDA was that non-aligned reads/writes did not result in abysmal performance, but resulted in e.g. 2 write commands being issues instead of one.
Think of it like putting people on a bus. If you can fill your whole crowd into a single bus with one destination, you get better efficiency than using two buses that are only half filled.
So yes, it will matter, but depending on whether you are memory or compute bound, it will matter differently.
Arranging your read/write patterns to utilize the full bandwidth have given me the last 20-30% performance in many applications.
/Henrik

What advantages are there to programming for a non-cache-coherent multi-core machine?

What advantages are there to programming for a non-cache-coherent multi-core machine? Cache_coherence has many benefits, but how would one take advantage of the opposite of this feature - an independent cache for each individual core. What programming paradigm and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
You don't as such take advantage of cache non-coherence. You can't write code which relies on different cores having different views of memory, because a non-coherent cache doesn't guarantee to show different memory to different cores. It just reserves the right to do that.
Cache coherence costs circuits and time. Non-coherent caches are therefore cheaper (and cooler, perhaps?) and faster. Memory access might be faster in cycles, or might be the same best-case speed but with fewer stalls due to cache synchronisation and especially false sharing.
So it's not so much extra things you do to take advantage of non-coherence, it's the things that you don't have to do because you've dropped the disadvantages of coherence - you don't have to redesign your parallel code because it's spending all its time sitting around waiting for the result of a memory store from another core.
The downside on a non-coherent cache architecture at first appears to be that find yourself using additional synchronisation that's provided automatically by coherent caches. No double-checked locking for you. Then you realise that in effect, the coherent-cache architectures do this synchronisation (albeit in a super-fast hardware-implemented form) for every single memory access, and block if the cache line is dirty, whether you need it to or not. That cheers me right up :-)
What programming paradigm
Message passing.
and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
Pattern matching - the input block of memory could very well be "read-only": the "output" result can very well be placed in separate blocks waiting for a "reducer" of some sort.
Of course, this is just an example amongst many I am sure.
Just to make things clear: the principal reasons for going with "non-cache-coherent" architecture are cost & speed (assuming the problems at hand are more efficiently tackled using this architecture).
You can get a bit of extra performance, but you shoul never rely on each processor having different cache values, as you can never know when the cache is flushed.
I'm not an expert; but I don't think it has any advantage over a cache coherent architecture, besides from being simpler to implement. Of course, such simplicity can allow other optimizations that could be prohibitive in a more complex coherent system, making the non-coherent machine faster when carefully programmed.
said that, i concur with jldupont, message passing doesn't need coherency, so it's (almost) the mandatory way to do IPC.
You could think of the Cell SPE local memory as a sort of cache. It isn't cache really since it isn't automatic at all, but the speed is the same and it isn't coherent.
It has big speed advantages because the hardware does not need to spend any time synchronizing the cache line states between cores.
In a Cell, the programmer must do the synchronization manually by writing code to copy SPE local memory back and forth. So a disadvantage is much greater program complexity.

What are some tricks that a processor does to optimize code?

I am looking for things like reordering of code that could even break the code in the case of a multiple processor.
The most important one would be memory access reordering.
Absent memory fences or serializing instructions, the processor is free to reorder memory accesses. Some processor architectures have restrictions on how much they can reorder; Alpha is known for being the weakest (i.e., the one which can reorder the most).
A very good treatment of the subject can be found in the Linux kernel source documentation, at Documentation/memory-barriers.txt.
Most of the time, it's best to use locking primitives from your compiler or standard library; these are well tested, should have all the necessary memory barriers in place, and are probably quite optimized (optimizing locking primitives is tricky; even the experts can get them wrong sometimes).
Wikipedia has a fairly comprehensive list of optimization techniques here.
Yes, but what exactly is your question?
However, since this is an interesting topic: tricks that compilers and processors use to optimize code should not break code, even with multiple processors, in the absence of race conditions in that code. This is called the guarantee of sequential consistency: if your program does not have any race conditions, and all data is correctly locked before accessing, the code will behave as if it were executed sequentially.
There is a really good video of Herb Sutter talking about this here:
http://video.google.com/videoplay?docid=-4714369049736584770
Everyone should watch this :)
DavidK's answer is correct, however it is also very important to be aware of the memory model for your language/runtime. Even without race conditions and with sequential consistency and mutex usage your code can still break when data is being cached by different threads running in the different cores of the cpu. Some languages, Java is one example, ensure the state of data between threads when a mutex lock is used, but it is rarely enough to simply ensure that no two threads can access the data at the same time. You need to use the mutex in a correct way to ensure that the language runtime synchronizes the data state between the two threads. In java this is done by having the two threads synchronize on the same object.
Here is a good page explaining the problem and how it's dealt with in javas memory model.
http://gee.cs.oswego.edu/dl/cpj/jmm.html