What's the difference between lockless and lockfree? - lock-free

In some articles about algorithm, some use the word lockfree, and some use lockless. What's the difference between lockless and lockfree? Thanks!
Update
http://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-dpdk-programmers-guide.pdf
section 5.2 --"Lockless Ring Buffer in Linux*", it's a example of use word "lockless"

An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.
In general, a lock-free algorithm can run in four phases: completing one's own operation, assisting an obstructing operation, aborting an obstructing operation, and waiting. Completing one's own operation is complicated by the possibility of concurrent assistance and abortion, but is invariably the fastest path to completion. e.g. Non blocking algorithms
Lockless programming, is a set of techniques for safely manipulating shared data without using locks. There are lockless algorithms available for passing messages, sharing lists and queues of data, and other tasks. Lockless programming is pretty complicated. e.g. All purely functional data structures are inherently lock-free, since they are immutable

Lock-free is a more formal thing (look for lock-free algorithms). The essence of it for data structures is that if two threads/processes access the data structure and one of them dies, the other one is still guaranteed to complete the operation.
Lockless is about implementation - it means the algorithm does not use locks (or using the more formal name - mutual exclusion).
Therefore a lock-free algorithm is also lockless (because if one thread locks and then dies the other one would wait forever) but not the other way around - there are algorithms which don't use locks (e.g. they use compare-and-swap) but still can hang if the other process dies. The dpdk ring buffer mentioned above is an example of lockless which is not lock-free.

Related

Most generally correct way of updating a vertex buffer in Vulkan

Assume a vertex buffer in device memory and a staging buffer that's host coherent and visible. Also assume a desktop system with a discrete GPU (so separate memories). And lastly, assume correct inter-frame synchronization.
I see two general possible ways of updating a vertex buffer:
Map + memcpy + unmap into the staging buffer, followed by a transient (single command) command buffer that contains a vkCmdCopyBuffer, submit it to the graphics queue and wait for the queue to idle, then free the transient command buffer. After that submit the regular frame draw queue to the graphics queue as usual. This is the code used on https://vulkan-tutorial.com (for example, this .cpp file).
Similar to above, only instead use additional semaphores to signal after the staging buffer copy submit, and wait in the regular frame draw submit, thus skipping the "wait-for-idle" command.
#2 sort of makes sense to me, and I've repeatedly read not to do any "wait-for-idle" operations in Vulkan because it synchronizes the CPU with the GPU, but I've never seen it used in any tutorial or example online. What do the pros usually do if the vertex buffer has to be updated relatively often?
First, if you allocated coherent memory, then you almost certainly did so in order to access it from the CPU. Which requires mapping it. Vulkan is not OpenGL; there is no requirement that memory be unmapped before it can be used (and OpenGL doesn't even have that requirement anymore).
Unmapping memory should only ever be done when you are about to delete the memory allocation itself.
Second, if you think of an idea that involves having the CPU wait for a queue or device to idle before proceeding, then you have come up with a bad idea and should use a different one. The only time you should wait for a device to idle is when you want to destroy the device.
Tutorial code should not be trusted to give best practices. It is often intended to be simple, to make it easy to understand a concept. Simple Vulkan code often gets in the way of performance (and if you don't care about performance, you shouldn't be using Vulkan).
In any case, there is no "most generally correct way" to do most things in Vulkan. There are lots of definitely incorrect ways, but no "generally do this" advice. Vulkan is a low-level, explicit API, and the result of that is that you need to apply Vulkan's tools to your specific circumstances. And maybe profile on different hardware.
For example, if you're generating completely new vertex data every frame, it may be better to see if the implementation can read vertex data directly from coherent memory, so that there's no need for a staging buffer at all. Yes, the reads may be slower, but the overall process may be faster than a transfer followed by a read.
Then again, it may not. It may be faster on some hardware, and slower on others. And some hardware may not allow you to use coherent memory for any buffer that has the vertex input usage at all. And even if it's allowed, you may be able to do other work during the transfer, and thus the GPU spends minimal time waiting before reading the transferred data. And some hardware has a small pool of device-local memory which you can directly write to from the CPU; this memory is meant for these kinds of streaming applications.
If you are going to do staging however, then your choices are primarily about which queue you submit the transfer operation on (assuming the hardware has multiple queues). And this primarily relates to how much latency you're willing to endure.
For example, if you're streaming data for a large terrain system, then it's probably OK if it takes a frame or two for the vertex data to be usable on the GPU. In that case, you should look for an alternative, transfer-only queue on which to perform the copy from the staging buffer to the primary memory. If you do, then you'll need to make sure that later commands which use the eventual results synchronize with that queue, which will need to be done via a semaphore.
If you're in a low-latency scenario where the data being transferred needs to be used this frame, then it may be better to submit both to the same queue. You could use an event to synchronize them rather than a semaphore. But you should also endeavor to put some kind of unrelated work between the transfer and the rendering operation, so that you can take advantage of some degree of parallelism in operations.

What is a Hardware Semaphore?

How can it be used from software if it is a hardware semaphore? Is it that there is a software API which is actually implemented in HW?
I ask as I am implementing firmware to interface to some hardware. There is going to be a lot of information exchange between the hardware and the firmware. I overhead talk of hardware semaphore and just wanted to find out more information on it. Some literature on this would be helpful
You are mostly correct. There is a SW API that requires some special hardware to work correctly. Implementations of semaphores in software, of which there are a few, are all based on some sort of HW instruction that is guaranteed to be atomic.
Atomicity in HW is required to implement a semaphore. Normally HW instructions are not atomic.
To elaborate somewhat, you need to implement a semaphore by reading and writing a piece of shared memory which is visible to more than 1 processor. Reading and writing that shared piece of memory is not an atomic operation in general: for example if you do a read followed by a write there could be other instructions that are scheduled between the read and write.
The hardware needs to make sure the bus is locked from other masters accessing the resource before the second part of the operation i.e write takes place. Usually this is done at arbitration stage in hardware.
In a computer system having at least two processors, each processor having an associated memory, the processors being coupled to one another through an interface unit by means of a bus, hardware semaphores to regulate access to shared resources are disclosed. Each semaphore is one bit wide and can be written to obtain the desired state. When reading the semaphore, if the contents is a one, then a one is returned. If the content is zero, a zero is returned but the semaphore is automatically reset to one.

What advantages are there to programming for a non-cache-coherent multi-core machine?

What advantages are there to programming for a non-cache-coherent multi-core machine? Cache_coherence has many benefits, but how would one take advantage of the opposite of this feature - an independent cache for each individual core. What programming paradigm and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
You don't as such take advantage of cache non-coherence. You can't write code which relies on different cores having different views of memory, because a non-coherent cache doesn't guarantee to show different memory to different cores. It just reserves the right to do that.
Cache coherence costs circuits and time. Non-coherent caches are therefore cheaper (and cooler, perhaps?) and faster. Memory access might be faster in cycles, or might be the same best-case speed but with fewer stalls due to cache synchronisation and especially false sharing.
So it's not so much extra things you do to take advantage of non-coherence, it's the things that you don't have to do because you've dropped the disadvantages of coherence - you don't have to redesign your parallel code because it's spending all its time sitting around waiting for the result of a memory store from another core.
The downside on a non-coherent cache architecture at first appears to be that find yourself using additional synchronisation that's provided automatically by coherent caches. No double-checked locking for you. Then you realise that in effect, the coherent-cache architectures do this synchronisation (albeit in a super-fast hardware-implemented form) for every single memory access, and block if the cache line is dirty, whether you need it to or not. That cheers me right up :-)
What programming paradigm
Message passing.
and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
Pattern matching - the input block of memory could very well be "read-only": the "output" result can very well be placed in separate blocks waiting for a "reducer" of some sort.
Of course, this is just an example amongst many I am sure.
Just to make things clear: the principal reasons for going with "non-cache-coherent" architecture are cost & speed (assuming the problems at hand are more efficiently tackled using this architecture).
You can get a bit of extra performance, but you shoul never rely on each processor having different cache values, as you can never know when the cache is flushed.
I'm not an expert; but I don't think it has any advantage over a cache coherent architecture, besides from being simpler to implement. Of course, such simplicity can allow other optimizations that could be prohibitive in a more complex coherent system, making the non-coherent machine faster when carefully programmed.
said that, i concur with jldupont, message passing doesn't need coherency, so it's (almost) the mandatory way to do IPC.
You could think of the Cell SPE local memory as a sort of cache. It isn't cache really since it isn't automatic at all, but the speed is the same and it isn't coherent.
It has big speed advantages because the hardware does not need to spend any time synchronizing the cache line states between cores.
In a Cell, the programmer must do the synchronization manually by writing code to copy SPE local memory back and forth. So a disadvantage is much greater program complexity.

What are some tricks that a processor does to optimize code?

I am looking for things like reordering of code that could even break the code in the case of a multiple processor.
The most important one would be memory access reordering.
Absent memory fences or serializing instructions, the processor is free to reorder memory accesses. Some processor architectures have restrictions on how much they can reorder; Alpha is known for being the weakest (i.e., the one which can reorder the most).
A very good treatment of the subject can be found in the Linux kernel source documentation, at Documentation/memory-barriers.txt.
Most of the time, it's best to use locking primitives from your compiler or standard library; these are well tested, should have all the necessary memory barriers in place, and are probably quite optimized (optimizing locking primitives is tricky; even the experts can get them wrong sometimes).
Wikipedia has a fairly comprehensive list of optimization techniques here.
Yes, but what exactly is your question?
However, since this is an interesting topic: tricks that compilers and processors use to optimize code should not break code, even with multiple processors, in the absence of race conditions in that code. This is called the guarantee of sequential consistency: if your program does not have any race conditions, and all data is correctly locked before accessing, the code will behave as if it were executed sequentially.
There is a really good video of Herb Sutter talking about this here:
http://video.google.com/videoplay?docid=-4714369049736584770
Everyone should watch this :)
DavidK's answer is correct, however it is also very important to be aware of the memory model for your language/runtime. Even without race conditions and with sequential consistency and mutex usage your code can still break when data is being cached by different threads running in the different cores of the cpu. Some languages, Java is one example, ensure the state of data between threads when a mutex lock is used, but it is rarely enough to simply ensure that no two threads can access the data at the same time. You need to use the mutex in a correct way to ensure that the language runtime synchronizes the data state between the two threads. In java this is done by having the two threads synchronize on the same object.
Here is a good page explaining the problem and how it's dealt with in javas memory model.
http://gee.cs.oswego.edu/dl/cpj/jmm.html

Techniques to Get rid of low level Locking

I'm wondering, and in need, of strategies that can be applied to reducing low-level locking.
However the catch here is that this is not new code (with tens of thousands of lines of C++ code) for a server application, so I can't just rewrite the whole thing.
I fear there might not be a solution to this problem by now (too late). However I'd like to hear about good patterns others have used.
Right now there are too many lock and not as many conflicts, so it's a paranoia induced hardware performance issue.
The best way to describe the code is as single threaded code suddenly getting peppered with locks.
Why do you need to eliminate the low-level locking? Do you have deadlock issues? Do you have performance problems? Or scaling issues? Are the locks generally contended or uncontended?
What environment are you using? The answers in C++ will be different to the ones in Java, for example. E.g. uncontended synchronization blocks in Java 6 are actually relatively cheap in performance terms, so simply upgrading your JRE might get you past whatever problem you are trying to solve. There might be similar performance boosts available in C++ by switching to a different compiler or locking library.
In general, there are several strategies that allow you to reduce the number of mutexes you acquire.
First, anything only ever accessed from a single thread doesn't need a mutex.
Second, anything immutable is safe provided it is 'safely published' (i.e. created in such a way that a partially constructed object is never visible to another thread).
Third, most platforms now support atomic writes - which can help when a single primitive type (including a pointer) is all that needs protecting. These work very similarly to optimistic locking in a database. You can also use atomic writes to create lock-free algorithms to replace more complex types, including Map implementations. However, unless you are very, very good, you are much better off borrowing somebody else's debugged implementation (the java.util.concurrent package contains lots of good examples) - it is notoriously easy to accidentally introduce bugs when writing your own algorithms.
Fourth, widening the scope of the mutex can help - either simply holding open a mutex for longer, rather than constantly locking and unlocking it, or taking a lock on a 'larger' item - the object rather than one of its properties, for example. However, this has to be done extremely carefully; you can easily introduce problems this way.
The threading model of your program has to be decided before a single line is written. Any module, if inconsistent with the rest of the program, can crash, corrupt of deadlock the application.
If you have the luxury of starting fresh, try to identify large functions of your program that can be done in parallel and use a thread pool to schedule the tasks. The trick to efficiency is to avoid mutexes wherever possible and (re)code your app to avoid contention for resources at a high level.
You may find some of the answers here and here helpful as you look for ways to atomically update shared state without explicit locks.