Do cuDevicePrimaryCtxReset() and cudaDeviceReset() do the same thing? - cuda

Reading the CUDA Runtime API and Driver API docs, it seems that the two functions:
CUresult cuDevicePrimaryCtxReset ( CUdevice dev );
__host__ ​cudaError_t cudaDeviceReset ( void );
do the same thing (upto having to cudaSetDevice(dev) before the runtime API call):
Destroy all allocations and reset all state on the primary context.
for the first and
Destroy all allocations and reset all state on the current device in the current process.
Do they, indeed, do the same? Or are there perhaps subtle differences that I'm missing or that aren't documented? e.g. something related to threads-vs-processes?

They're quite different.
Examining the program #RobertCrovella linked to, it seems that:
cuDevicePrimaryCtxReset() only destroys/resets the primary context, not touching other contexts.
cudaDeviceReset() destroys all contexts for the specified device, removing them from the context stack.

Related

How to implement handles for a CUDA driver API library?

Note: The question has been updated to address the questions that have been raised in the comments, and to emphasize that the core of the question is about the interdependencies between the Runtime- and Driver API
The CUDA runtime libraries (like CUBLAS or CUFFT) are generally using the concept of a "handle" that summarizes the state and context of such a library. The usage pattern is quite simple:
// Create a handle
cublasHandle_t handle;
cublasCreate(&handle);
// Call some functions, always passing in the handle as the first argument
cublasSscal(handle, ...);
// When done, destroy the handle
cublasDestroy(handle);
However, there are many subtle details about how these handles interoperate with Driver- and Runtime contexts and multiple threads and devices. The documentation lists several, scattered details about context handling:
The general description of contexts in the CUDA Programming Guide at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#context
The handling of multiple contexts, as described in the CUDA Best Practices Guide at http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#multiple-contexts
The context management differences between runtime and driver API, explained at http://docs.nvidia.com/cuda/cuda-driver-api/driver-vs-runtime-api.html
The general description of CUBLAS contexts/handles at http://docs.nvidia.com/cuda/cublas/index.html#cublas-context and their thread safety at http://docs.nvidia.com/cuda/cublas/index.html#thread-safety2
However, some of information seems to be not entirely up to date (for example, I think one should use cuCtxSetCurrent instead of cuCtxPushCurrent and cuCtxPopCurrent?), some of it seems to be from a time before the "Primary Context" handling was exposed via the driver API, and some parts are oversimplified in that they only show the most simple usage patterns, make only vague or incomplete statements about multithreading, or cannot be applied to the concept of "handles" that is used in the runtime libraries.
My goal is to implement a runtime library that offers its own "handle" type, and that allows usage patterns that are equivalent to the other runtime libraries in terms of context handling and thread safety.
For the case that the library can internally be implemented solely using the Runtime API, things may be clear: The context management is solely in the responsibility of the user. If he creates an own driver context, the rules that are stated in the documentation about the Runtime- and Driver context management will apply. Otherwise, the Runtime API functions will take care of the handling of primary contexts.
However, there may be the case that a library will internally have to use the Driver API. For example, in order to load PTX files as CUmodule objects, and obtain the CUfunction objects from them. And when the library should - for the user - behave like a Runtime library, but internally has to use the Driver API, some questions arise about how the context handling has to be implemented "under the hood".
What I have figured out so far is sketched here.
(It is "pseudocode" in that it omits the error checks and other details, and ... all this is supposed to be implemented in Java, but that should not be relevant here)
1. The "Handle" is basically a class/struct containing the following information:
class Handle
{
CUcontext context;
boolean usingPrimaryContext;
CUdevice device;
}
2. When it is created, two cases have to be covered: It can be created when a driver context is current for the calling thread. In this case, it should use this context. Otherwise, it should use the primary context of the current (runtime) device:
Handle createHandle()
{
cuInit(0);
// Obtain the current context
CUcontext context;
cuCtxGetCurrent(&context);
CUdevice device;
// If there is no context, use the primary context
boolean usingPrimaryContext = false;
if (context == nullptr)
{
usingPrimaryContext = true;
// Obtain the device that is currently selected via the runtime API
int deviceIndex;
cudaGetDevice(&deviceIndex);
// Obtain the device and its primary context
cuDeviceGet(&device, deviceIndex);
cuDevicePrimaryCtxRetain(&context, device));
cuCtxSetCurrent(context);
}
else
{
cuCtxGetDevice(device);
}
// Create the actual handle. This might internally allocate
// memory or do other things that are specific for the context
// for which the handle is created
Handle handle = new Handle(device, context, usingPrimaryContext);
return handle;
}
3. When invoking a kernel of the library, the context of the associated handle is made current for the calling thread:
void someLibraryFunction(Handle handle)
{
cuCtxSetCurrent(handle.context);
callMyKernel(...);
}
Here, one could argue that the caller is responsible for making sure that the required context is current. But if the handle was created for a primary context, then this context will be made current automatically.
4. When the handle is destroyed, this means that cuDevicePrimaryCtxRelease has to be called, but only when the context is a primary context:
void destroyHandle(Handle handle)
{
if (handle.usingPrimaryContext)
{
cuDevicePrimaryCtxRelease(handle.device);
}
}
From my experiments so far, this seems to expose the same behavior as a CUBLAS handle, for example. But my possibilities for thoroughly testing this are limited, because I only have a single device, and thus cannot test the crucial cases, e.g. of having two contexts, one for each of two devices.
So my questions are:
Are there any established patterns for implementing such a "Handle"?
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
(I also had a look at the context handling in tensorflow, but I'm not sure whether one can derive recommendations about how to implement handles for a runtime library from that...)
(An "Update" has been removed here, because it was added in response to the comments, and should no longer be relevant)
I'm sorry I hadn't noticed this question sooner - as we might have collaborated on this somewhat. Also, it's not quite clear to me whether this question belongs here, on codereview.SX or on programmers.SX, but let's ignore all that.
I have now done what you were aiming to do, and possibly more generally. So, I can offer both an example of what to do with "handles", and moreover, suggest the prospect of not having to implement this at all.
The library is an expanding of cuda-api-wrappers to also cover the Driver API and NVRTC; it is not yet release-grade, but it is in the testing phase, on this branch.
Now, to answer your concrete question:
Pattern for writing a class surrounding a raw "handle"
Are there any established patterns for implementing such a "Handle"?
Yes. If you read:
What is the difference between: Handle, Pointer and Reference
you'll notice a handle is defined as an "opaque reference to an object". It has some similarity to a pointer. A relevant pattern, therefore, is a variation on the PIMPL idiom: In regular PIMPL, you write an implementation class, and the outwards-facing class only holds a pointer to the implementation class and forwards method calls to it. When you have an opaque handle to an opaque object in some third-party library or driver - you use the handle to forward method calls to that implementation.
That means, that your outwards-facing class is not a handle, it represents the object to which you have a handle.
Generality and flexibility
Are there any usage patterns (e.g. with multiple devices and one context per device) that could not be covered with the approach that is sketched above, but would be covered with the "handle" implementations of CUBLAS?
I'm not sure what exactly CUBLAS does under the hood (and I have almost never used CUBLAS to be honest), but if it were well-designed and implemented, it would
create its own context, and try to not to impinge on the rest of your code, i.e. it would alwas do:
Push our CUBLAS context onto the top of the stack
Do actual work
Pop the top of the context stack.
Your class doesn't do this.
More generally: Are there any recommendations of how to improve the current "Handle" implementation?
Yes:
Use RAII whenever it is possible and relevant. If your creation code allocates a resource (e.g. via the CUDA driver) - the destructor for the object you return should safely release those resources.
Allow for both reference-type and value-type use of Handles, i.e. it may be the handle I created, but it may also be a handle I got from somewhere else and isn't my responsibility. This is trivial if you leave it up to the user to release resources, but a bit tricky if you take that responsibility
You assume that if there's any current context, that's the one your handle needs to use. Says who? At the very least, let the user pass a context in if they want to.
Avoid writing the low-level parts of this on your own unless you really must. You are quite likely to miss some things (the push-and-pop is not the only thing you might be missing), and you're repeating a lot of work that is actually generic and not specific to your application or library. I may be biased here, but you can now use nice, RAII-ish, wrappers for CUDA contexts, streams, modules, devices etc. without even known about raw handles for anything.
Rhetorical: Is the source code of the CUBLAS handle handling available somewhere?
To the best of my knowledge, NVIDIA hasn't released it.

CUDA: calling a __host__ function() from a __global__ function() is not allowed [duplicate]

As the following error implies, calling a host function ('rand') is not allowed in kernel, and I wonder whether there is a solution for it if I do need to do that.
error: calling a host function("rand") from a __device__/__global__ function("xS_v1_cuda") is not allowed
Unfortunately you can not call functions in device that are not specified with __device__ modifier. If you need in random numbers in device code look at cuda random generator curand http://developer.nvidia.com/curand
If you have your own host function that you want to call from a kernel use both the __host__ and __device__ modifiers on it:
__host__ __device__ int add( int a, int b )
{
return a + b;
}
When this file is compiled by the NVCC compiler driver, two versions of the functions are compiled: one callable by host code and another callable by device code. And this is why this function can now be called both by host and device code.
The short answer is that here is no solution to that issue.
Everything that normally runs on a CPU must be tailored for a CUDA environment without any guarantees that it is even possible to do. Host functions are just another name in CUDA for ordinary C functions. That is, functions running on a CPU-memory Von Neumann architecture like all C/C++ has been up to this point in PCs. GPUs give you tremendous amounts of computing power but the cost is that it is not nearly as flexible or compatible. Most importantly, the functions run without the ability to access main memory and the memory they can access is limited.
If what you are trying to get is a random number generator you are in luck considering that Nvidia went to the trouble of specifically implementing a highly efficient Mersenne Twister that can support up to 256 threads per SMP. It is callable inside a device function, described in an earlier post of mine here. If anyone finds a better link describing this functionality please remove mine and replace the appropriate text here along with the link.
One thing I am continually surprised by is how many programmers seem unaware of how standardized high quality pseudo-random number generators are. "Rolling your own" is really not a good idea considering how much of an art pseudo-random numbers are. Verifying a generator as providing acceptably unpredictable numbers takes a lot of work and academic talent...
While not applicable to 'rand()' but a few host functions like "printf" are available when compiling with compute compatibility >= 2.0
e.g:
nvcc.exe -gencode=arch=compute_10,code=\sm_10,compute_10\...
error : calling a host function("printf") from a __device__/__global__ function("myKernel") is not allowed
Compiles and works with sm_20,compute_20
I have to disagree with some of the other answers in the following sense:
OP does not describe a problem: it is not unfortunate that you cannot call __host__ functions from device code - it is entirely impossible for it to be any other way, and that's not a bad thing.
To explain: Think of the host (CPU) code like a CD which you put into a CD player; and on the device code like a, say, SD card which you put into a a miniature music player. OP's question is "how can I shove a disc into my miniature music player"? You can't, and it makes no sense to want to. It might be the same music essentially (code with the same functionality; although usually, host code and device code don't perform quite the same computational task) - but the media are not interchangeable.

Global device variables in CUDA: bad practice?

I am designing a library that has a large contingent of CUDA kernels to perform parallel computations. All the kernels will be acting on a common object, say a computational grid, which is defined using C++ style objects. The computational domain doesn't necessarily need to be accessed from the host side, so creating it on the device side and keeping it there makes sense for now. I'm wondering if the following is considered "good practice":
Suppose my computational grid class is called Domain. First I Define a global device-side variable to store the computational domain:
__device__ Domain* D
Then I Initialize the computational domain using a CUDA kernel
__global__ void initDomain(paramType P){
D = new Domain(P);
}
Then, I perform computations using this domain with other kernels:
__global__ void doComputation(double *x,double *y){
D->doThing(x,y);
//...
}
If my domain remains fixed (i.e. kernels don't modify the domain once it's created), is this OK? Is there a better way? I initially tried creating the Domain object on the host side and copying it over to the device, but this turned out to be a hassle because Domain is a relatively complex type that makes it a pain to copy over using e.g. cudaMemCpy or even Thrust::device_new (at least, I couldn't get it to work nicely).
Yes it's ok.
Maybe you can improve performance using
__constant__
using this keyword, your object will be available in all your kernels in a very fast memory.
In order to copy your object, you must use : cudaMemcpyToSymbol, please note there is come restriction : your object will be read-only in your device code, and it must don't have default constructor.
You can find informations here
If your object is complex and hard to copy, maybe you can look for : Unified memory, then just pass your variable by value to your kernel.

Are CUDA streams device-associated? And how do I get a stream's device?

I have a CUDA stream which someone handed to me - a cudaStream_t value. The CUDA Runtime API does not seem to indicate how I can obtain the index of the device with which this stream is associated.
Now, I know that cudaStream_t is just a pointer to a driver-level stream structure, but I'm hesitant to delve into the driver too much. Is there an idiomatic way to do this? Or some good reason not to want to do it?
Edit: Another aspect to this question is whether the stream really is associated with a device in a way in which the CUDA driver itself can determine that device's identity given the pointed-to structure.
Yes, streams are device-specific.
In CUDA, streams are specific to a context, and contexts are specific to a device.
Now, with the runtime API, you don't "see" contexts - you use just one context per device. But if you consider the driver API - you have:
CUresult cuStreamGetCtx ( CUstream hStream, CUcontext* pctx );
CUstream and cudaStream_t are the same thing - a pointer. So, you can get the context. Then, you set or push that context to be the current context (read about doing that elsewhere), and finally, you use:
CUresult cuCtxGetDevice ( CUdevice* device )
to get the current context's device.
So, a bit of a hassle, but quite doable.
My approach to easily determining a stream's device
My workaround for this issue is to have the (C++'ish) stream wrapper class keep (the context and) the device among the member variables, which means that you can write:
auto my_device = cuda::device::get(1);
auto my_stream = my_device.create_stream(); /* using some default param values here */
assert(my_stream.device() == my_device());
and not have to worry about it (+ it won't trigger the extra API calls since, at construction, we know what the current context is and what its device is).
Note: The above snippet is for a system with at least two CUDA devices, otherwise there is no device with index 1...
Regarding to the explicit streams, it is up to the implementation (to the best of my knowledge) there is no API providing this potential query capability to the users; I don't know about the capabilities that the driver can provide for you in this front, however, you can always query the stream.
By using cudaStreamQuery, you can query your targeted stream on your selected device, if it returns cudaSuccess or cudaErrorNotReady it means that the stream does exist on that device and if it returns cudaErrorInvalidResourceHandle, it means that it does not.

Check context of given resource

Lets imagine the situation, that I have a lot of initialized resources for example: streams, host and device memory end events, part of them are initialized in context of one GPU and the rest of them belong to the other GPU context.
Is there a way to check if given resource (event, stream or memory) belongs to certain GPU context?
In some case it would worthy to assert such things, before order memory copy or kernel execution and then get cudaErrorInvalidArgument.
I am not really aware of such option in CUDA API itself. It is just a low-level sets of orders that you can issue to your GPU.
What I would do is to wrap the CUDA API functions into some nice class which would track what is where and what is initialised. A class representing a GPU might be useful as well.