What are some tricks that a processor does to optimize code? - language-agnostic

I am looking for things like reordering of code that could even break the code in the case of a multiple processor.

The most important one would be memory access reordering.
Absent memory fences or serializing instructions, the processor is free to reorder memory accesses. Some processor architectures have restrictions on how much they can reorder; Alpha is known for being the weakest (i.e., the one which can reorder the most).
A very good treatment of the subject can be found in the Linux kernel source documentation, at Documentation/memory-barriers.txt.
Most of the time, it's best to use locking primitives from your compiler or standard library; these are well tested, should have all the necessary memory barriers in place, and are probably quite optimized (optimizing locking primitives is tricky; even the experts can get them wrong sometimes).

Wikipedia has a fairly comprehensive list of optimization techniques here.

Yes, but what exactly is your question?
However, since this is an interesting topic: tricks that compilers and processors use to optimize code should not break code, even with multiple processors, in the absence of race conditions in that code. This is called the guarantee of sequential consistency: if your program does not have any race conditions, and all data is correctly locked before accessing, the code will behave as if it were executed sequentially.
There is a really good video of Herb Sutter talking about this here:
http://video.google.com/videoplay?docid=-4714369049736584770
Everyone should watch this :)

DavidK's answer is correct, however it is also very important to be aware of the memory model for your language/runtime. Even without race conditions and with sequential consistency and mutex usage your code can still break when data is being cached by different threads running in the different cores of the cpu. Some languages, Java is one example, ensure the state of data between threads when a mutex lock is used, but it is rarely enough to simply ensure that no two threads can access the data at the same time. You need to use the mutex in a correct way to ensure that the language runtime synchronizes the data state between the two threads. In java this is done by having the two threads synchronize on the same object.
Here is a good page explaining the problem and how it's dealt with in javas memory model.
http://gee.cs.oswego.edu/dl/cpj/jmm.html

Related

Why are CUDA indices 2D? [duplicate]

I have basically the same question as posed in this discussion. In particular I want to refer to this final response:
I think there are two different questions mixed together in this
thread:
Is there a performance benefit to using a 2D or 3D mapping of input or output data to threads? The answer is "absolutely" for all the
reasons you and others have described. If the data or calculation has
spatial locality, then so should the assignment of work to threads in
a warp.
Is there a performance benefit to using CUDA's multidimensional grids to do this work assignment? In this case, I don't think so since
you can do the index calculation trivially yourself at the top of the
kernel. This burns a few arithmetic instructions, but that should be
negligible compared to the kernel launch overhead.
This is why I think the multidimensional grids are intended as a
programmer convenience rather than a way to improve performance. You
do absolutely need to think about each warp's memory access patterns,
though.
I want to know if this situation still holds today. I want to know the reason why there is a need for a multidimensional "outer" grid.
What I'm trying to understand is whether or not there is a significant purpose to this (e.g. an actual benefit from spatial locality) or is it there for convenience (e.g. in an image processing context, is it there only so that we can have CUDA be aware of the x/y "patch" that a particular block is processing so it can report it to the CUDA Visual Profiler or something)?
A third option is that this nothing more than a holdover from earlier versions of CUDA where it was a workaround for hardware indexing limits.
There is definitely a benefit in the use of multi-dimensional grid. The different entries (tid, ctaid) are read-only variables visible as special registers. See PTX ISA
PTX includes a number of predefined, read-only variables, which are visible as special registers and accessed through mov or cvt instructions.
The special registers are:
%tid
%ntid
%laneid
%warpid
%nwarpid
%ctaid
%nctaid
If some of this data may be used without further processing, not-only you may gain arithmetic instructions - potentially at each indexing step of multi-dimension data, but more importantly you are saving registers which is a very scarce resource on any hardware.

CUDA contexts, streams, and events on multiple GPUs

TL;DR version: "What's the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?" with a side of "I can't have been the first person to ask this; anything I should read up on?"
Full version:
I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I've been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven't found anything that outlines best practices, rules of thumb, etc.
The general overview of what we're needing to do is:
Requests come in to a central process.
That process forks to handle a single request.
Data is loaded from the DB (relatively expensive).
The the following is repeated an arbitrary number of times based on the request (dozens):
A few quick kernel calls to compute data that is needed for later kernels.
One slow kernel call (10 sec).
Finally:
Results from the kernel calls are collected and processed on the CPU, then stored.
At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that's not huge, it is precluding us from moving other quicker tasks to the GPU.
I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it's relatively simple:
Create a context before starting any of the GPU work.
Launch the kernels for the first set of data.
Record an event for after the final kernel call in the series.
Prepare the second set of data on the CPU while the first is computing on the GPU.
Launch the second set, repeat.
Insure that each event gets synchronized before collecting the results and storing them.
That seems like it should do the trick, assuming proper use of overlapped memory copies.
However, I'm unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.
The host program is Python 2.7, using PyCUDA to access the GPU. Currently it's not multi-threaded, and while I'd rather keep it that way ("now you have two problems" etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it's time to block on data, but for our needs efficient use of the hardware is more important. Since we'll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn't using it is important.
I don't think that we have any explicit reason to use Exclusive compute modes (ie. we're not filling up the memory of the card with one work item), so I don't think that solutions that involve long-standing contexts are off the table.
Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!
Caveat: I'm not a PyCUDA user (yet).
With CUDA 4.0+ you don't even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).
If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.
So I suspect the answer to day is quite a lot simpler than talonmies' excellent pre-CUDA-4.0 answer.
You might also find this answer useful.
(Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies' answer is still relevant.

CUDA: Stop all other threads

I have a problem that is seemingly solvable by enumerating all possible solutions and then finding the best. In order to do so, I devised a backtracking algorithm that enumerates and stores the best solution if found. It works fine so far.
Now, I wanted to port this algorithm to CUDA. Therefore, I created a procedure that generates some distinct basic cases. These basic cases should be processed in parallel on the GPU. If one of the CUDA-threads finds an optimal solution, all the other threads can - of course - stop their work.
So, I wanted kind of the following: The thread that finds the optimal solution should stop all running CUDA-threads of my program, thus finishing calculation.
After some quick search, I found that threads can only communicate if they are in the same block. (So I suppose it's impossible to stop others blocks threads.)
The only method I could think of is that I have a dedicated flag optimum_found, which is checked at the beginning of every kernel. If an optimum solution is found, this flag is set to 1, so all future threads know that they do not have to work. But of course, threads already running do not notice this flag if they do not check it at every iteration.
So, is there a possibility to stop all remaining CUDA-threads?
I think that your method of having a dedicated flag could work provided that it was a memory location in global memory. That way you can check this, as you said, at the beginning of each kernel call.
Kernel calls should generally be relatively short anyways, therefore letting the other threads in a batch finish even though an optimal solution was found by one of those threads shouldn't affect your performance too much.
That said, I am fairly sure there is no CUDA call that can kill off other actively executing threads.
I think Ian has the right idea here. Optimum performance would come from minimal memory transfers and branching. Writing to global memory and checking flags (branching) goes against the CUDA best practices guide and will reduce your speedup.
You might want to look at callbacks. The main CPU thread can make sure all threads run in the right order. CPU callback threads (read: postprocessing) can do additional overhead and call the related api functions as well as disposing all of the sub thread data... This feature is found in cuda samples and compiles on cuda capability 2. Hope this helps.

What advantages are there to programming for a non-cache-coherent multi-core machine?

What advantages are there to programming for a non-cache-coherent multi-core machine? Cache_coherence has many benefits, but how would one take advantage of the opposite of this feature - an independent cache for each individual core. What programming paradigm and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
You don't as such take advantage of cache non-coherence. You can't write code which relies on different cores having different views of memory, because a non-coherent cache doesn't guarantee to show different memory to different cores. It just reserves the right to do that.
Cache coherence costs circuits and time. Non-coherent caches are therefore cheaper (and cooler, perhaps?) and faster. Memory access might be faster in cycles, or might be the same best-case speed but with fewer stalls due to cache synchronisation and especially false sharing.
So it's not so much extra things you do to take advantage of non-coherence, it's the things that you don't have to do because you've dropped the disadvantages of coherence - you don't have to redesign your parallel code because it's spending all its time sitting around waiting for the result of a memory store from another core.
The downside on a non-coherent cache architecture at first appears to be that find yourself using additional synchronisation that's provided automatically by coherent caches. No double-checked locking for you. Then you realise that in effect, the coherent-cache architectures do this synchronisation (albeit in a super-fast hardware-implemented form) for every single memory access, and block if the cache line is dirty, whether you need it to or not. That cheers me right up :-)
What programming paradigm
Message passing.
and to what particular practical problems would such an architecture be beneficial over a cache-coherent one?
Pattern matching - the input block of memory could very well be "read-only": the "output" result can very well be placed in separate blocks waiting for a "reducer" of some sort.
Of course, this is just an example amongst many I am sure.
Just to make things clear: the principal reasons for going with "non-cache-coherent" architecture are cost & speed (assuming the problems at hand are more efficiently tackled using this architecture).
You can get a bit of extra performance, but you shoul never rely on each processor having different cache values, as you can never know when the cache is flushed.
I'm not an expert; but I don't think it has any advantage over a cache coherent architecture, besides from being simpler to implement. Of course, such simplicity can allow other optimizations that could be prohibitive in a more complex coherent system, making the non-coherent machine faster when carefully programmed.
said that, i concur with jldupont, message passing doesn't need coherency, so it's (almost) the mandatory way to do IPC.
You could think of the Cell SPE local memory as a sort of cache. It isn't cache really since it isn't automatic at all, but the speed is the same and it isn't coherent.
It has big speed advantages because the hardware does not need to spend any time synchronizing the cache line states between cores.
In a Cell, the programmer must do the synchronization manually by writing code to copy SPE local memory back and forth. So a disadvantage is much greater program complexity.

Techniques to Get rid of low level Locking

I'm wondering, and in need, of strategies that can be applied to reducing low-level locking.
However the catch here is that this is not new code (with tens of thousands of lines of C++ code) for a server application, so I can't just rewrite the whole thing.
I fear there might not be a solution to this problem by now (too late). However I'd like to hear about good patterns others have used.
Right now there are too many lock and not as many conflicts, so it's a paranoia induced hardware performance issue.
The best way to describe the code is as single threaded code suddenly getting peppered with locks.
Why do you need to eliminate the low-level locking? Do you have deadlock issues? Do you have performance problems? Or scaling issues? Are the locks generally contended or uncontended?
What environment are you using? The answers in C++ will be different to the ones in Java, for example. E.g. uncontended synchronization blocks in Java 6 are actually relatively cheap in performance terms, so simply upgrading your JRE might get you past whatever problem you are trying to solve. There might be similar performance boosts available in C++ by switching to a different compiler or locking library.
In general, there are several strategies that allow you to reduce the number of mutexes you acquire.
First, anything only ever accessed from a single thread doesn't need a mutex.
Second, anything immutable is safe provided it is 'safely published' (i.e. created in such a way that a partially constructed object is never visible to another thread).
Third, most platforms now support atomic writes - which can help when a single primitive type (including a pointer) is all that needs protecting. These work very similarly to optimistic locking in a database. You can also use atomic writes to create lock-free algorithms to replace more complex types, including Map implementations. However, unless you are very, very good, you are much better off borrowing somebody else's debugged implementation (the java.util.concurrent package contains lots of good examples) - it is notoriously easy to accidentally introduce bugs when writing your own algorithms.
Fourth, widening the scope of the mutex can help - either simply holding open a mutex for longer, rather than constantly locking and unlocking it, or taking a lock on a 'larger' item - the object rather than one of its properties, for example. However, this has to be done extremely carefully; you can easily introduce problems this way.
The threading model of your program has to be decided before a single line is written. Any module, if inconsistent with the rest of the program, can crash, corrupt of deadlock the application.
If you have the luxury of starting fresh, try to identify large functions of your program that can be done in parallel and use a thread pool to schedule the tasks. The trick to efficiency is to avoid mutexes wherever possible and (re)code your app to avoid contention for resources at a high level.
You may find some of the answers here and here helpful as you look for ways to atomically update shared state without explicit locks.