cuvidGetDecoderCaps not work with primary context [duplicate] - cuda

How can I create a CUDA context?
The first call of CUDA is slow and I want to create the context before I launch my kernel.

The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context.
EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are slightly different and cudaSetDevice() itself establishes context on the device is it called on. So the explicit cudaFree(0) call is no longer necessary (although it won't hurt anything).

Using the runtime API: cudaDeviceSynchronize, cudaDeviceGetLimit, or anything that actually accesses the context should work.
I'm quite certain you're not using the driver API, as it doesn't do that sort of lazy initialization, but for others' benefit the driver call would be cuCtxCreate.

Related

Until when exactly can I set a CUDA device' (primary context) scheduling policy?

I want to set the scheduling policy for (the primary context of a) CUDA device.
Reading the Runtime API guide, I notice that:
If the current device has been set and that device has already been initialized then this call will fail with the error cudaErrorSetOnActiveProcess. In this case it is necessary to reset device using cudaDeviceReset() before the device's initialization flags may be set.
Setting the current device - I understand. But what exactly does it mean for the device to have been "initialized"? How do I avoid it being "initialized"?
Note: this question is related.
It seems this means the initialiation of the device's primary context. This, for example
cuDevicePrimaryCtxRetain(&pc, device_id);
should make it so that you can no longer set the policy. This makes sense, since for other contexts you also set such policy once, on context creation.

How to implement handles for a CUDA driver API library?

Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API
The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:
// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);
// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);
// When done, destroy the handle
cublasDestroy(handle);
However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:
The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.
My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.
For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.
However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".
What I have figured out so far is sketched here.
(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)
1. The "Handle" is basically a class/struct containing the following information:
class Handle
{
CUcontext context;
boolean usingPrimaryContext;
CUdevice device;
}
2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:
Handle createHandle()
{
cuInit(0);
// Obtain the current context
CUcontext context;
cuCtxGetCurrent(&context);
CUdevice device;
// If there is no context, use the primary context
boolean usingPrimaryContext = false;
if (context == nullptr)
{
usingPrimaryContext = true;
// Obtain the device that is currently selected via the runtime API
int deviceIndex;
cudaGetDevice(&deviceIndex);
// Obtain the device and its primary context
cuDeviceGet(&device, deviceIndex);
cuDevicePrimaryCtxRetain(&context, device));
cuCtxSetCurrent(context);
}
else
{
cuCtxGetDevice(device);
}
// Create the actual handle. This might internally allocate
// memory or do other things that are specific for the context
// for which the handle is created
Handle handle = new Handle(device, context, usingPrimaryContext);
return handle;
}
3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:
void someLibraryFunction(Handle handle)
{
cuCtxSetCurrent(handle.context);
callMyKernel(...);
}
Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.
4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:
void destroyHandle(Handle handle)
{
if (handle.usingPrimaryContext)
{
cuDevicePrimaryCtxRelease(handle.device);
}
}
From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.
So my questions are:
Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)
(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)
I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.
I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.
The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.
Now, to answer your concrete question:
Pattern for writing a class surrounding a raw "handle"
Are there any established patterns for implementing such a "Handle"?
Yes. If you read:
What is the difference between: Handle, Pointer and Reference
you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.
That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.
Generality and flexibility
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would
create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:
Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.
Your class doesn't do this.
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Yes:
Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
To the best of my knowledge, NVIDIA hasn't released it.

What is a CUDA context?

Can anyone please explain or refer me some good source about what is a CUDA context? I searched CUDA developer guide and I was not satisfied with it.
Any explanation or help will be great.
The cuda API exposes features of a stateful library: two consecutive calls relate one-another. In short, the context is its state.
The runtime API is a wrapper/helper of the driver API. You can see in the driver API that the context is explicitly made available, and you can have a stack of contexts for convenience. There is one specific context which is shared between driver and runtime API (See primary context)).
The context holds all the management data to control and use the device. For instance, it holds the list of allocated memory, the loaded modules that contain device code, the mapping between CPU and GPU memory for zero copy, etc.
Finally, note that this post is more from experience than documentation-proofed.
essentially, a data structure that holds information relevant to mantaining a consistent state between the calls that you make, e.g. (open) (execute) (close)
This is so that the functions that you invoke can send the signals in the right direction even if you don't specifically tell them what that direction is.

Are CUDA streams device-associated? And how do I get a stream's device?

I have a CUDA stream which someone handed to me - a cudaStream_t value. The CUDA Runtime API does not seem to indicate how I can obtain the index of the device with which this stream is associated.
Now, I know that cudaStream_t is just a pointer to a driver-level stream structure, but I'm hesitant to delve into the driver too much. Is there an idiomatic way to do this? Or some good reason not to want to do it?
Edit: Another aspect to this question is whether the stream really is associated with a device in a way in which the CUDA driver itself can determine that device's identity given the pointed-to structure.
Yes, streams are device-specific.
In CUDA, streams are specific to a context, and contexts are specific to a device.
Now, with the runtime API, you don't "see" contexts - you use just one context per device. But if you consider the driver API - you have:
CUresult cuStreamGetCtx ( CUstream hStream, CUcontext* pctx );
CUstream and cudaStream_t are the same thing - a pointer. So, you can get the context. Then, you set or push that context to be the current context (read about doing that elsewhere), and finally, you use:
CUresult cuCtxGetDevice ( CUdevice* device )
to get the current context's device.
So, a bit of a hassle, but quite doable.
My approach to easily determining a stream's device
My workaround for this issue is to have the (C++'ish) stream wrapper class keep (the context and) the device among the member variables, which means that you can write:
auto my_device = cuda::device::get(1);
auto my_stream = my_device.create_stream(); /* using some default param values here */
assert(my_stream.device() == my_device());
and not have to worry about it (+ it won't trigger the extra API calls since, at construction, we know what the current context is and what its device is).
Note: The above snippet is for a system with at least two CUDA devices, otherwise there is no device with index 1...
Regarding to the explicit streams, it is up to the implementation (to the best of my knowledge) there is no API providing this potential query capability to the users; I don't know about the capabilities that the driver can provide for you in this front, however, you can always query the stream.
By using cudaStreamQuery, you can query your targeted stream on your selected device, if it returns cudaSuccess or cudaErrorNotReady it means that the stream does exist on that device and if it returns cudaErrorInvalidResourceHandle, it means that it does not.

How to create a CUDA context?

How can I create a CUDA context?
The first call of CUDA is slow and I want to create the context before I launch my kernel.
The canonical way to force runtime API context establishment is to call cudaFree(0). If you have multiple devices, call cudaSetDevice() with the ID of the device you want to establish a context on, then cudaFree(0) to establish the context.
EDIT: Note that as of CUDA 5.0, it appears that the heuristics of context establishment are slightly different and cudaSetDevice() itself establishes context on the device is it called on. So the explicit cudaFree(0) call is no longer necessary (although it won't hurt anything).
Using the runtime API: cudaDeviceSynchronize, cudaDeviceGetLimit, or anything that actually accesses the context should work.
I'm quite certain you're not using the driver API, as it doesn't do that sort of lazy initialization, but for others' benefit the driver call would be cuCtxCreate.