Following on from a discussion which got going in the comments of this question.
How would one go about writing a Spinlock without CAS operations?
As the other question states:
The memory ordering model is such that writes will be atomic (if two concurrent threads write a memory location at the same time, the result will be one or the other). The platform will not support atomic compare-and-set operations.
Wikipedia's article on spinlock says you'll have to use an algorithm like Peterson's algortihm, which uses another flag to indicate which process's turn it is to enter the critical section (if desired).
Related
I am currently studying the specifications for RISC-V with specification version 2.2 and Privileged Architecture version 1.10. In Chapter 2 of RISC-V specification, it is mentioned that "[...] though a simple implementation might cover the eight SCALL/SBREAK/CSRR* instructions with a single SYSTEM hardware instruction that always traps [...]"
However, when I look at the privileged specification, the instruction MRET is also a SYSTEM instruction, which is required to return from a trap. Right now I am confused how much of the Machine-level ISA are required: is it possible to omit all M-level CSRs and use a software handler for any SYSTEM instructions, as stated in Specification? If so, how does one pass in information such as return address and trap cause? Are they done through regular registers x1-x31?
Alternatively, is it enough to implement only the following M-level CSRs, if I am aiming for a simple embedded core with only M-level privilege?
mvendorid
marchid
mimpid
mhartid
misa
mscratch
mepc
mcause
Finally, how many of these CSRs can be omitted?
ECALL/EBREAK instructions are traps anyway. CSR instructions need to be carefully parsed to make sure they specify existent registers being accessed in allowed modes, which sounds like a job for your favorite sparse matrix, whether PLA or if/then.
You could emulate all SYSTEM instructions, but, as you see, you need to be able to access information inside the hardware that is not part of the normal ISA. This implies that you need to add "instruction extensions."
I would also recommend making the SYSTEM instructions atomic, meaning that exceptions should be masked or avoided within each emulated instruction.
Since I am not a very trusting person, I would create a new mode that would enable the instruction extensions that would allow you to read the exception address directly from the hardware, for example, and fetch instructions from a protected area of memory. Interrupts would be disabled automatically. The mode would be exited by branching to epc+4 or the illegal instruction handler. I would not want to have anything outside the RISC-V spec available even in M-mode, just to be safe.
In my experience, it is better to say "I do everything," than it is to explain to each customer, or worse, have a competitor explain to your customers, what it is that you do not do. But perhaps someone who knows the CSRs better could help; it is not something I do.
In some articles about algorithm, some use the word lockfree, and some use lockless. What's the difference between lockless and lockfree? Thanks!
Update
http://www.intel.com/content/dam/www/public/us/en/documents/guides/intel-dpdk-programmers-guide.pdf
section 5.2 --"Lockless Ring Buffer in Linux*", it's a example of use word "lockless"
An algorithm is lock-free if it satisfies that when the program threads are run sufficiently long at least one of the threads makes progress (for some sensible definition of progress). All wait-free algorithms are lock-free.
In general, a lock-free algorithm can run in four phases: completing one's own operation, assisting an obstructing operation, aborting an obstructing operation, and waiting. Completing one's own operation is complicated by the possibility of concurrent assistance and abortion, but is invariably the fastest path to completion. e.g. Non blocking algorithms
Lockless programming, is a set of techniques for safely manipulating shared data without using locks. There are lockless algorithms available for passing messages, sharing lists and queues of data, and other tasks. Lockless programming is pretty complicated. e.g. All purely functional data structures are inherently lock-free, since they are immutable
Lock-free is a more formal thing (look for lock-free algorithms). The essence of it for data structures is that if two threads/processes access the data structure and one of them dies, the other one is still guaranteed to complete the operation.
Lockless is about implementation - it means the algorithm does not use locks (or using the more formal name - mutual exclusion).
Therefore a lock-free algorithm is also lockless (because if one thread locks and then dies the other one would wait forever) but not the other way around - there are algorithms which don't use locks (e.g. they use compare-and-swap) but still can hang if the other process dies. The dpdk ring buffer mentioned above is an example of lockless which is not lock-free.
TL;DR version: "What's the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?" with a side of "I can't have been the first person to ask this; anything I should read up on?"
Full version:
I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I've been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven't found anything that outlines best practices, rules of thumb, etc.
The general overview of what we're needing to do is:
Requests come in to a central process.
That process forks to handle a single request.
Data is loaded from the DB (relatively expensive).
The the following is repeated an arbitrary number of times based on the request (dozens):
A few quick kernel calls to compute data that is needed for later kernels.
One slow kernel call (10 sec).
Finally:
Results from the kernel calls are collected and processed on the CPU, then stored.
At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that's not huge, it is precluding us from moving other quicker tasks to the GPU.
I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it's relatively simple:
Create a context before starting any of the GPU work.
Launch the kernels for the first set of data.
Record an event for after the final kernel call in the series.
Prepare the second set of data on the CPU while the first is computing on the GPU.
Launch the second set, repeat.
Insure that each event gets synchronized before collecting the results and storing them.
That seems like it should do the trick, assuming proper use of overlapped memory copies.
However, I'm unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.
The host program is Python 2.7, using PyCUDA to access the GPU. Currently it's not multi-threaded, and while I'd rather keep it that way ("now you have two problems" etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it's time to block on data, but for our needs efficient use of the hardware is more important. Since we'll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn't using it is important.
I don't think that we have any explicit reason to use Exclusive compute modes (ie. we're not filling up the memory of the card with one work item), so I don't think that solutions that involve long-standing contexts are off the table.
Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!
Caveat: I'm not a PyCUDA user (yet).
With CUDA 4.0+ you don't even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).
If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.
So I suspect the answer to day is quite a lot simpler than talonmies' excellent pre-CUDA-4.0 answer.
You might also find this answer useful.
(Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies' answer is still relevant.
I have gone through many forum posts and the NVIDIA documentation, but I couldn't understand what __threadfence() does and how to use it. Could someone explain what the purpose of that intrinsic is?
Normally, there are no guarantee that if one block writes something to global memory, the other block will "see" it. There is also no guarantee regarding the ordering of writes to global memory, with an exception of the block that issued it.
There are two exceptions:
atomic operations - those are always visible by other blocks
threadfence
Imagine, that one block produces some data, and then uses atomic operation to mark a flag that the data is there. But it is possible that the other block, after seeing the flag, still reads incorrect or incomplete data.
The __threadfence function, coming to the rescue, ensures the ordering. All writes before it really happen before all writes after it, as seen from other blocks.
Note that the __threadfence function doesn't necessarily need to stall the current thread until its writes to global memory are visible to all other threads in the grid. Implemented in this naive way, the __threadfence function could hurt performance severely.
As an example, if you do something like:
store your data
__threadfence()
atomically mark a flag
it is guaranteed that if the other block sees the flag, it will also see the data.
Further reading: Cuda Programming Guide, Chapter B.5 (as of version 11.5)
I am looking for things like reordering of code that could even break the code in the case of a multiple processor.
The most important one would be memory access reordering.
Absent memory fences or serializing instructions, the processor is free to reorder memory accesses. Some processor architectures have restrictions on how much they can reorder; Alpha is known for being the weakest (i.e., the one which can reorder the most).
A very good treatment of the subject can be found in the Linux kernel source documentation, at Documentation/memory-barriers.txt.
Most of the time, it's best to use locking primitives from your compiler or standard library; these are well tested, should have all the necessary memory barriers in place, and are probably quite optimized (optimizing locking primitives is tricky; even the experts can get them wrong sometimes).
Wikipedia has a fairly comprehensive list of optimization techniques here.
Yes, but what exactly is your question?
However, since this is an interesting topic: tricks that compilers and processors use to optimize code should not break code, even with multiple processors, in the absence of race conditions in that code. This is called the guarantee of sequential consistency: if your program does not have any race conditions, and all data is correctly locked before accessing, the code will behave as if it were executed sequentially.
There is a really good video of Herb Sutter talking about this here:
http://video.google.com/videoplay?docid=-4714369049736584770
Everyone should watch this :)
DavidK's answer is correct, however it is also very important to be aware of the memory model for your language/runtime. Even without race conditions and with sequential consistency and mutex usage your code can still break when data is being cached by different threads running in the different cores of the cpu. Some languages, Java is one example, ensure the state of data between threads when a mutex lock is used, but it is rarely enough to simply ensure that no two threads can access the data at the same time. You need to use the mutex in a correct way to ensure that the language runtime synchronizes the data state between the two threads. In java this is done by having the two threads synchronize on the same object.
Here is a good page explaining the problem and how it's dealt with in javas memory model.
http://gee.cs.oswego.edu/dl/cpj/jmm.html