How to implement handles for a CUDA driver API library?

How to implement handles for a CUDA driver API library? - cuda

Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API
The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:
// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);
// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);
// When done, destroy the handle
cublasDestroy(handle);
However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:
The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.
My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.
For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.
However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".
What I have figured out so far is sketched here.
(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)
1. The "Handle" is basically a class/struct containing the following information:
class Handle
{
CUcontext context;
boolean usingPrimaryContext;
CUdevice device;
}
2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:
Handle createHandle()
{
cuInit(0);
// Obtain the current context
CUcontext context;
cuCtxGetCurrent(&context);
CUdevice device;
// If there is no context, use the primary context
boolean usingPrimaryContext = false;
if (context == nullptr)
{
usingPrimaryContext = true;
// Obtain the device that is currently selected via the runtime API
int deviceIndex;
cudaGetDevice(&deviceIndex);
// Obtain the device and its primary context
cuDeviceGet(&device, deviceIndex);
cuDevicePrimaryCtxRetain(&context, device));
cuCtxSetCurrent(context);
}
else
{
cuCtxGetDevice(device);
}
// Create the actual handle. This might internally allocate
// memory or do other things that are specific for the context
// for which the handle is created
Handle handle = new Handle(device, context, usingPrimaryContext);
return handle;
}
3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:
void someLibraryFunction(Handle handle)
{
cuCtxSetCurrent(handle.context);
callMyKernel(...);
}
Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.
4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:
void destroyHandle(Handle handle)
{
if (handle.usingPrimaryContext)
{
cuDevicePrimaryCtxRelease(handle.device);
}
}
From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.
So my questions are:
Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)
(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)

I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.
I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.
The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.
Now, to answer your concrete question:
Pattern for writing a class surrounding a raw "handle"
Are there any established patterns for implementing such a "Handle"?
Yes. If you read:
What is the difference between: Handle, Pointer and Reference
you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.
That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.
Generality and flexibility
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would
create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:
Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.
Your class doesn't do this.
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Yes:
Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
To the best of my knowledge, NVIDIA hasn't released it.

Related

cuvidGetDecoderCaps not work with primary context [duplicate]

How can I create a CUDA context?
The first call of CUDA is slow and I want to create the context before I launch my kernel.

The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context.
EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are slightly different and cudaSetDevice() itself establishes context on the device is it called on. So the explicit cudaFree(0) call is no longer necessary (although it won't hurt anything).

Using the runtime API: cudaDeviceSynchronize, cudaDeviceGetLimit, or anything that actually accesses the context should work.
I'm quite certain you're not using the driver API, as it doesn't do that sort of lazy initialization, but for others' benefit the driver call would be cuCtxCreate.

What is a CUDA context?

Can anyone please explain or refer me some good source about what is a CUDA context? I searched CUDA developer guide and I was not satisfied with it.
Any explanation or help will be great.

The cuda API exposes features of a stateful library: two consecutive calls relate one-another. In short, the context is its state.
The runtime API is a wrapper/helper of the driver API. You can see in the driver API that the context is explicitly made available, and you can have a stack of contexts for convenience. There is one specific context which is shared between driver and runtime API (See primary context)).
The context holds all the management data to control and use the device. For instance, it holds the list of allocated memory, the loaded modules that contain device code, the mapping between CPU and GPU memory for zero copy, etc.
Finally, note that this post is more from experience than documentation-proofed.

essentially, a data structure that holds information relevant to mantaining a consistent state between the calls that you make, e.g. (open) (execute) (close)
This is so that the functions that you invoke can send the signals in the right direction even if you don't specifically tell them what that direction is.

To use mem and then find or to use find and handle the exception?

Suppose, I want to write a function that tries to find a key in a map and returns None if it cannot: try_find: 'a -> ('a, 'b) Map.t -> 'b option, what is the canonical way to do this? To first check that the key exists with mem and then call find? Or to catch the Not_found exception? Batteries seem to do the latter.
On the other hand, in languages like C# or Java people are usually discouraged from using exceptions in such cases, for performance reasons. Is using exceptions on "normal" execution paths a usual thing in Ocaml or is it also discouraged?

OCaml exceptions are as fast as function calls for the default backend. For Javascript backends, it is not always true. The canonical OCaml way is to implement a function that doesn't throw an exception is to use a throwing function and translate the exception to a nullary variant, e.g.,
let try_find x xs = try Some (List.find x xs) with Not_found -> None
Calling mem and find is a loss of performance, as you will actually iterate the list twice.
There are tradeoffs between raising an exception and returning an option type. The standard function List.find will not allocate any new values in the heap, so no garbage will be created. On the other hand, the try_find function will allocate a new value every time something is found (None is a constant so it is not allocated). This will create an extra work for the garbage collector, that will eventually degrade the performance. To me, the semantic benefits of total functions outweigh possible performance degradation. If the latter does matter (in case of tight loops) then I can always optimize it locally by either using an exception in a very tight context, or continuation passing style and/or GADT.
Is using exceptions on "normal" execution paths a usual thing in Ocaml or is it also discouraged?
It wasn't discouraged by the design of the language, and OCaml standard library uses exceptions a lot. However, the language evolves, and new features are added to the language. Moreover, new backends are implemented, like several Javascript backends, Java, and .Net backends. It is not trivial, to provide the same performance guarantees for these backends. So with a time, the popularity of exceptions reduced, and many people started to favor total functions with explicitly encoded errors, cf., the newly added to the standard library result type. Another example is Janestreet Core library (and all other libraries) that disfavor exceptions and use them only for exceptional cases.
You should decide by yourself an exception policy (or borrow the existing one). My personal policy is trying to avoid them in the public interfaces and sparingly use them very locally. I also use exceptions, for logic and programmer errors, basically, for errors, that shouldn't be captured.

From what I've seen, OCaml exceptions are quite efficient, and I see them being used more often than in other functional languages I guess.
I try to avoid them myself as they interfere with reasoning about the program. But a self-contained use in a library doesn't seem so bad.
The efficiency of low-level things like exceptions is something that might vary a lot from platform to platform. I suspect that catching the Not_found exception would be faster for very large maps, as it avoids traversing the map twice. Otherwise it might not matter much.

How to create a CUDA context?

How can I create a CUDA context?
The first call of CUDA is slow and I want to create the context before I launch my kernel.

The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context.
EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are slightly different and cudaSetDevice() itself establishes context on the device is it called on. So the explicit cudaFree(0) call is no longer necessary (although it won't hurt anything).

Using the runtime API: cudaDeviceSynchronize, cudaDeviceGetLimit, or anything that actually accesses the context should work.
I'm quite certain you're not using the driver API, as it doesn't do that sort of lazy initialization, but for others' benefit the driver call would be cuCtxCreate.

What is the difference between message-passing and method-invocation?

Is there a difference between message-passing and method-invocation, or can they be considered equivalent? This is probably specific to the language; many languages don't support message-passing (though all the ones I can think of support methods) and the ones that do can have entirely different implementations. Also, there are big differences in method-invocation depending on the language (C vs. Java vs Lisp vs your favorite language). I believe this is language-agnostic. What can you do with a passed-method that you can't do with an invoked-method, and vice-versa (in your favorite language)?

Using Objective-C as an example of messages and Java for methods, the major difference is that when you pass messages, the Object decides how it wants to handle that message (usually results in an instance method in the Object being called).
In Java however, method invocation is a more static thing, because you must have a reference to an Object of the type you are calling the method on, and a method with the same name and type signature must exist in that type, or the compiler will complain. What is interesting is the actual call is dynamic, although this is not obvious to the programmer.
For example, consider a class such as
class MyClass {
void doSomething() {}
}
class AnotherClass {
void someMethod() {
Object object = new Object();
object.doSomething(); // compiler checks and complains that Object contains no such method.
// However, through an explicit cast, you can calm the compiler down,
// even though your program will crash at runtime
((MyClass) object).doSomething(); // syntactically valid, yet incorrect
}
}
In Objective-C however, the compiler simply issues you a warning for passing a message to an Object that it thinks the Object may not understand, but ignoring it doesn't stop your program from executing.
While this is very powerful and flexible, it can result in hard-to-find bugs when used incorrectly because of stack corruption.
Adapted from the article here.
Also see this article for more information.

as a first approximation, the answer is: none, as long as you "behave normally"
Even though many people think there is - technically, it is usually the same: a cached lookup of a piece of code to be executed for a particular named-operation (at least for the normal case). Calling the name of the operation a "message" or a "virtual-method" does not make a difference.
BUT: the Actor language is really different: in having active objects (every object has an implicit message-queue and a worker thread - at least conceptionally), parallel processing becones easier to handle (google also "communicating sequential processes" for more).
BUT: in Smalltalk, it is possible to wrap objects to make them actor-like, without actually changing the compiler, the syntax or even recompiling.
BUT: in Smalltalk, when you try to send a message which is not understoof by the receiver (i.e. "someObject foo:arg"), a message-object is created, containing the name and the arguments, and that message-object is passed as argument to the "doesNotUnderstand" message. Thus, an object can decide itself how to deal with unimplemented message-sends (aka calls of an unimplemented method). It can - of course - push them into a queue for a worker process to sequentialize them...
Of course, this is impossible with statically typed languages (unless you make very heavy use of reflection), but is actually a VERY useful feature. Proxy objects, code load on demand, remote procedure calls, learning and self-modifying code, adapting and self-optimizing programs, corba and dcom wrappers, worker queues are all built upon that scheme. It can be misused, and lead to runtime bugs - of course.
So it it is a two-sided sword. Sharp and powerful, but dangerous in the hand of beginners...
EDIT: I am writing about language implementations here (as in Java vs. Smalltalk - not inter-process mechanisms.

IIRC, they've been formally proven to be equivalent. It doesn't take a whole lot of thinking to at least indicate that they should be. About all it takes is ignoring, for a moment, the direct equivalence of the called address with an actual spot in memory, and consider it simply as a number. From this viewpoint, the number is simply an abstract identifier that uniquely identifies a particular type of functionality you wish to invoke.
Even when you are invoking functions in the same machine, there's no real requirement that the called address directly specify the physical (or even virtual) address of the called function. For example, although almost nobody ever really uses them, Intel protected mode task gates allow a call to be made directly to the task gate itself. In this case, only the segment part of the address is treated as an actual address -- i.e., any call to a task gate segment ends up invoking the same address, regardless of the specified offset. If so desired, the processing code can examine the specified offset, and use it to decide upon an individual method to be invoked -- but the relationship between the specified offset and the address of the invoked function can be entirely arbitrary.
A member function call is simply a type of message passing that provides (or at least facilitates) an optimization under the common circumstance that the client and server of the service in question share a common address space. The 1:1 correspondence between the abstract service identifier and the address at which the provider of that service reside allows a trivial, exceptionally fast, mapping from one to the other.
At the same time, make no mistake about it: the fact that something looks like a member function call doesn't prevent it from actually executing on another machine or asynchronously, or (frequently) both. The typical mechanism to accomplish this is proxy function that translates the "virtual message" of a member function call into a "real message" that can (for example) be transmitted over a network as needed (e.g., Microsoft's DCOM, and CORBA both do this quite routinely).

They really aren't the same thing in practice. Message passing is a way to transfer data and instructions between two or more parallel processes. Method invocation is a way to call a subroutine. Erlang's concurrency is built on the former concept with its Concurrent Oriented Programing.
Message passing most likely involves a form of method invocation, but method invocation doesn't necessarily involve message passing. If it did it would be message passing. Message passing is one form of performing synchronization between to parallel processes. Method invocation generally means synchronous activities. The caller waits for the method to finish before it can continue. Message passing is a form of a coroutine. Method-invocation is a form of subroutine.
All subroutines are coroutines, but all coroutines are not subroutines.

Is there a difference between message-passing and method-invocation, or can they be considered equivalent?
They're similar. Some differences:
Messages can be passed synchronously or asynchronously (e.g. the difference between SendMessage and PostMessage in Windows)
You might send a message without knowing exactly which remote object you're sending it to
The target object might be on a remote machine or O/S.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008