Difference "conditionally safe" and "thread safe" - language-agnostic

I stumbled upon the thread safety article on Wikipedia; it distinguishes several levels of safety, especially:
Thread safe: Implementation is guaranteed to be free of race conditions when accessed by multiple threads simultaneously.
Conditionally safe: Different threads can access different objects simultaneously, and access to shared data is protected from race conditions.
But to me, both definitions look like different ways to say the same thing. Both guarantee there is no race condition on shared data.
Could someone explain the difference? Thanks.

You should understand that shared data is not the same thing in the two cases.
Thread safe talks about accessing a single instance from multiple threads. So shared data may be any member of that class, if accessed by public methods. It's not shared between instances (because there is only one) but only between threads.
Conditionally safe talks about accessing different instances, each from its own thread. Data must be shared between instances, so it can be only an aggregated member (perhaps provided by dependency injection), static member or a (external) singleton.
But, if you read all the citations in the mentioned Wikipedia article (the Qt one is wrong), you will understand that Wikipedia might have even misrepresented the IBM naming. Conditional by IBM means that only some methods from the class/API are threadsafe, or that the threadsafety depends on underlying services which the API can't influence (giving a fine example). The Qt naming convention of threadsafe vs. reentrant seems to be more appropriate, as it doesn't distinguish between thread shared data and instance shared data.

Related

How come the macro is used as a function, but is not implemented anywhere?

The following code is in MySQL 5.5 storage/example/ha_example.cc:
MYSQL_READ_ROW_START(table_share->db.str, table_share->table_name.str, TRUE);
rc= HA_ERR_END_OF_FILE;
MYSQL_READ_ROW_DONE(rc);
I search the MYSQL_READ_ROW_START definition in the whole project, and find it in the include/probes_mysql_nodtrace.h:
#define MYSQL_READ_ROW_START(arg0, arg1, arg2)
#define MYSQL_READ_ROW_START_ENABLED() (0)
#define MYSQL_READ_ROW_DONE(arg0)
#define MYSQL_READ_ROW_DONE_ENABLED() (0)
It is just an empty macro definition here.
My question is, How came this macro MYSQL_READ_ROW_START is not associate with any function, but used as a function in the above code?
Thanks.
These aren't traditional macros: they're probe points for DTrace,
an observability framework for Solaris, OS X, FreeBSD and various
other operating systems.
DTrace revolves around the notion that different providers offer
certain probes at which one can observe running executables or
even the operating system itself. Some providers are time-based;
by firing at regular intervals the probes can, for example, be
used to profile the use of a CPU. Other providers are code-based,
and their probes might, for example, fire at the entrance to and
exit from functions.
The code you highlight is an example of the USDT (User-land
Statically Defined Tracing) provider. The canonical use of the
USDT provider is to expose meaningful events within transactions.
For example, the beginning and end of a transaction might well
occur somewhere deep within different functions; in this case
it's best for the developer to identify exactly what he wants to
reveal and when.
A USDT probe is more than a switchable printf() although it can
of course be used to reveal information, e.g. some local value
such as the intermediate result of a transaction. A USDT probe
can also be used to trigger behaviour. For example, one might
want to activate some network probes for only the duration of a
certain transaction.
Returning to your question, USDT probes are implemented by writing
macros in the code that correspond to a description of the
provider in a ".d" file elsewhere. This is parsed by the
dtrace(1) utility, which generates a header file that is suitable
for compilation. On a system that lacks DTrace it would make
sense to define a header file in which the USDT macros became null
ops, and judging by the given filename (probes_mysql_nodtrace.h)
this is what you are observing.
See http://dev.mysql.com/tech-resources/articles/getting_started_dtrace_saha.html.
To quote:
DTrace probes are implemented by kernel modules called providers, each
of which performs a particular kind of instrumentation to create
probes. Providers can thus described as publishers of probes that can
be consumed by DTrace consumers (see below). Providers can be used for
instrumenting kernel and user-level code. For user-level code, there
are two ways in which probes can be defined- User-Level Statically
Defined Tracing (USDT) or PID provider.
So it appears to be up to DTrace providers to implement such a macro.

How do I use multi threading in TCL?

I'm trying to run two procedures in parallel. As TCL is the interpreter, it will process procedures one by one. Can someone explain with an example how I can use multi-threading in TCL?
These days, the usual way to do multi-threading in Tcl is to use its Thread extension — it's being developed along with the Tcl's core, but on certain platforms (such as various Linux-based OSes) you might need to install a separate package to get this extension available.
The threading model the Thread extension implements is "one thread per interpreter". This means, each thread can "host" just one Tcl interpreter (and an unlimited number of its child interpreters), but no code executed by any thread may access interpreters hosted in other threads. This, in turn, means that when you work with threads in Tcl, you have to master the idea of multiple interpreters.
The classical approach to exchanging data between interpreters running in different threads is message passing: you post scripts to the input queue of the target interpreter running in different thread and then wait for reply. On the other hand, thread-shared variables (implementing sharing memory by locking) is also available. Another available feature is support for thread pools.
Read the "Tcl and threads" wiki page, the Thread's extension manual pages.
The code examples are on the wiki. Here's just one of them.
Please note that if your procedures which, you think, have to be run in parrallel, are mostly I/O bound (that is, they read something from the network and/or send something there) and not CPU-bound (doing heavy computations), you might have better results with the event-based approach to processing: the Tcl has built-in support for the event loop, and you are able to make Tcl execute your code when the next chunk of data can be read from a channel (such as a network socket) or written to a channel.

CUDA contexts, streams, and events on multiple GPUs

TL;DR version: "What's the best way to round-robin kernel calls to multiple GPUs with Python/PyCUDA such that CPU and GPU work can happen in parallel?" with a side of "I can't have been the first person to ask this; anything I should read up on?"
Full version:
I would like to know the best way to design context, etc. handling in an application that uses CUDA on a system with multiple GPUs. I've been trying to find literature that talks about guidelines for when context reuse vs. recreation is appropriate, but so far haven't found anything that outlines best practices, rules of thumb, etc.
The general overview of what we're needing to do is:
Requests come in to a central process.
That process forks to handle a single request.
Data is loaded from the DB (relatively expensive).
The the following is repeated an arbitrary number of times based on the request (dozens):
A few quick kernel calls to compute data that is needed for later kernels.
One slow kernel call (10 sec).
Finally:
Results from the kernel calls are collected and processed on the CPU, then stored.
At the moment, each kernel call creates and then destroys a context, which seems wasteful. Setup is taking about 0.1 sec per context and kernel load, and while that's not huge, it is precluding us from moving other quicker tasks to the GPU.
I am trying to figure out the best way to manage contexts, etc. so that we can use the machine efficiently. I think that in the single-gpu case, it's relatively simple:
Create a context before starting any of the GPU work.
Launch the kernels for the first set of data.
Record an event for after the final kernel call in the series.
Prepare the second set of data on the CPU while the first is computing on the GPU.
Launch the second set, repeat.
Insure that each event gets synchronized before collecting the results and storing them.
That seems like it should do the trick, assuming proper use of overlapped memory copies.
However, I'm unsure what I should do when wanting to round-robin each of the dozens of items to process over multiple GPUs.
The host program is Python 2.7, using PyCUDA to access the GPU. Currently it's not multi-threaded, and while I'd rather keep it that way ("now you have two problems" etc.), if the answer means threads, it means threads. Similarly, it would be nice to just be able to call event.synchronize() in the main thread when it's time to block on data, but for our needs efficient use of the hardware is more important. Since we'll potentially be servicing multiple requests at a time, letting other processes use the GPU when this process isn't using it is important.
I don't think that we have any explicit reason to use Exclusive compute modes (ie. we're not filling up the memory of the card with one work item), so I don't think that solutions that involve long-standing contexts are off the table.
Note that answers in the form of links to other content that covers my questions are completely acceptable (encouraged, even), provided they go into enough detail about the why, not just the API. Thanks for reading!
Caveat: I'm not a PyCUDA user (yet).
With CUDA 4.0+ you don't even need an explicit context per GPU. You can just call cudaSetDevice (or the PyCUDA equivalent) before doing per-device stuff (cudaMalloc, cudaMemcpy, launch kernels, etc.).
If you need to synchronize between GPUs, you will need to potentially create streams and/or events and use cudaEventSynchronize (or the PyCUDA equivalent). You can even have one stream wait on an event inserted in another stream to do sophisticated dependencies.
So I suspect the answer to day is quite a lot simpler than talonmies' excellent pre-CUDA-4.0 answer.
You might also find this answer useful.
(Re)Edit by OP: Per my understanding, PyCUDA supports versions of CUDA prior to 4.0, and so still uses the old API/semantics (the driver API?), so talonmies' answer is still relevant.

Performances evaluation with Message Passing

I have to build a distributed application, using MPI.
One of the decision that I have to take is how to map instances of classes into process (and then into machines), in order to take maximum advantages from a distributed environment.
My question is: there is a model that let me choose the better mapping? I mean, some arrangements are surely wrong (for ex., putting in two different machines two objects that should process together a fairly large amount of data, in a sequential manner, without a stream of tokens to process), but there's a systematically way to determine such wrong arrangements, determined by flow of execution, message complexity, time taken by the computation done by the algorithmic components?
Well, there are data flow diagrams. Those can help identify parallelism's opportunities and pitfalls. The references on the wikipedia page might give you some more theoretical grounding.
When I worked at Lockheed Martin, I was exposed to CSIM, a tool they developed for modeling algorithm mapping to processing blocks.
Another thing you might try is the Join Calculus. I've found examples of programming with it to be surprisingly intuitive, and I think it's well grounded in theory. I'm not sure why it hasn't caught on more.
The other approach is the Pi Calculus, and I think that might be more popular, though it seems harder to understand.
A practical solution to this would be using a different model of distributed-memory parallel programming, that directly addresses your concerns. I work on the Charm++ programming system, whose model is that of individual objects sending messages from one to another. The runtime system facilitates automatic mapping of these objects to available processors, to account for issues of load balance and communication locality.

What are some tricks that a processor does to optimize code?

I am looking for things like reordering of code that could even break the code in the case of a multiple processor.
The most important one would be memory access reordering.
Absent memory fences or serializing instructions, the processor is free to reorder memory accesses. Some processor architectures have restrictions on how much they can reorder; Alpha is known for being the weakest (i.e., the one which can reorder the most).
A very good treatment of the subject can be found in the Linux kernel source documentation, at Documentation/memory-barriers.txt.
Most of the time, it's best to use locking primitives from your compiler or standard library; these are well tested, should have all the necessary memory barriers in place, and are probably quite optimized (optimizing locking primitives is tricky; even the experts can get them wrong sometimes).
Wikipedia has a fairly comprehensive list of optimization techniques here.
Yes, but what exactly is your question?
However, since this is an interesting topic: tricks that compilers and processors use to optimize code should not break code, even with multiple processors, in the absence of race conditions in that code. This is called the guarantee of sequential consistency: if your program does not have any race conditions, and all data is correctly locked before accessing, the code will behave as if it were executed sequentially.
There is a really good video of Herb Sutter talking about this here:
http://video.google.com/videoplay?docid=-4714369049736584770
Everyone should watch this :)
DavidK's answer is correct, however it is also very important to be aware of the memory model for your language/runtime. Even without race conditions and with sequential consistency and mutex usage your code can still break when data is being cached by different threads running in the different cores of the cpu. Some languages, Java is one example, ensure the state of data between threads when a mutex lock is used, but it is rarely enough to simply ensure that no two threads can access the data at the same time. You need to use the mutex in a correct way to ensure that the language runtime synchronizes the data state between the two threads. In java this is done by having the two threads synchronize on the same object.
Here is a good page explaining the problem and how it's dealt with in javas memory model.
http://gee.cs.oswego.edu/dl/cpj/jmm.html