cuda runtime api and dynamic kernel definition - cuda

Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.
[1] CUDA Driver API vs. CUDA runtime

To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:
The CUDA Toolkit 3.0 Beta is now available.
Highlights for this release include:
CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.
There is documentation here which succinctly describes how the driver and runtime API interact.
To concretely answer your main question:
If one wants dynamic kernel definition as in cuModuleLoad and cublas
at the same time, what are the options?
The basic approach goes something like this:
Use the driver API to establish a context on the device as you would normally do.
Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment

Related

CUDA Driver API - minimum driver version?

I know that each CUDA toolkit has a minimum required driver, what I'm wondering is the following: suppose I'm loading each function pointer for each driver API function (e.g. cuInit) via dlsym from libcuda.so. I use no runtime API and neither link against cudart. My kernel uses virtual architectures to be JIT-ted at runtime (and the architecture is quite low, e.g. compute_30 so that I'm content with any kepler-and-above device).
Does the minimum driver required restriction still apply in my case?
Yes, there is still a minimum driver version requirement.
The GPU driver has a CUDA version that it is designed to be compatible with. This can be discovered in a variety of ways, one of which is to run the deviceQuery (or deviceQueryDrv) sample code.
Therefore a particular GPU driver will have a "compatibility" associated with a particular CUDA version.
In order to run correctly, Driver API codes will require an installed GPU Driver that is compatible with (i.e. has a CUDA compatibility version equal to or greater than) the CUDA version that the Driver API code was compiled against.
The CUDA/GPU Driver compatibility relationships, and the concept of forward compatibility, are similar to what is described in this question/answer.
To extend/generalize the ("forward") compatibility relationship statement from the previous answer, newer GPU Driver versions are generally compatible with older CUDA codes, whether those codes were compiled against the CUDA Runtime or CUDA Driver APIs.

Kernel mode GPGPU usage

Is it possible to run CUDA or OpenCL applications from a Linux kernel module?
I have found a project which is providing this functionality, but it needs a userspace helper in order to run CUDA programs. (https://code.google.com/p/kgpu/)
While this project already avoids redundant memory copying between user and kernel space I am wondering if it is possible to avoid the userspace completely?
EDIT:
Let me expand my question. I am aware that kernel components can only call the API provided by the kernel and other kernel components. So I am not looking to call OpenCL or CUDA API directly.
CUDA or OpenCL API in the end has to call into the graphics driver in order to make its magic happen. Most probably this interface is completely non-standard, changing with every release and so on....
But suppose that you have a compiled OpenCL or CUDA kernel that you would want to run. Do the OpenCL/CUDA userspace libraries do some heavy lifting before actually running the kernel or are they just lightweight wrappers around the driver interface?
I am also aware that the userspace helper is probably the best bet for doing this since calling the driver directly would most likely get broken with a new driver release...
The short answer is, no you can't do this.
There is no way to call any code which relies on glibc from kernel space. That implies that there is no way of making CUDA or OpenCL API calls from kernel space, because those libraries rely on glibc and a host of other user space helper libraries and user space system APIs which are unavailable in kernel space. CUDA and OpenCL aren't unique in this respect -- it is why the whole of X11 runs in userspace, for example.
A userspace helper application working via a simple kernel module interface is the best you can do.
[EDIT]
The runtime components of OpenCL are not lightweight wrappers around a few syscalls to push a code payload onto the device. Amongst other things, they include a full just in time compilation toolchain (in fact that is all that OpenCL has supported until very recently), internal ELF code and object management and a bunch of other stuff. There is very little likelihood that you could emulate the interface and functionality from within a driver at runtime.

How does CUDA profiling work "under the hood"?

Can anyone explain how the profiler works. How it measures all the time , instructions etc given the executable. I know how to run a profiler. I wanted to know its background working.
I want to develop a profiler of my own. So I need to understand how the existing profiler works.
I am provided with the executable and need to develop a profiler to profile the executable.
You can start by reading the CUPTI Documentation.
The CUDA Profiling Tools Interface (CUPTI) enables the creation of
profiling and tracing tools that target CUDA applications. CUPTI
provides four APIs: the Activity API, the Callback API, the Event API,
and the Metric API. Using these APIs, you can develop profiling tools
that give insight into the CPU and GPU behavior of CUDA applications.
CUPTI is delivered as a dynamic library on all platforms supported by
CUDA.
And CUPTI Metric API is what you should read, and you should always be aware of which CUDA version is your target, because some of the API are different than the previous or the next version.

CUDA runtime api interception

Can anyone explain how to intercept calls to CUDA Runtime API?
I am a newbie and I have read a bit about linux library interception.
I want to use the same concept so that I can intercept Cuda Runtime Api.
The CUPTI SDK included in the CUDA Toolkit provides support for enabling callbacks on enter and exit of the CUDA runtime API. It is possible to do some modification to state in the callbacks but the current callback system does not allow you to modify the value of the parameters or to skip the real function.
If you need the ability to modify input and output parameters then I recommend you generate an interception layer. Doxygen perlmod and a fairly small perl script can be used to generate an interception layer.
I believe the ocelot source code has a full CUDA runtime interception layer.
On Linux you can use LD_PRELOAD to insert your interception layer into the application.

How to use the context created by the runtime API from the driver API

A library that I link to uses the cuda runtime API. Thus it implicitly creates a cuda context when first calling a cuda function.
My code (that uses the library) should use the driver API. Now, how can i get both (runtime and driver API) to work at the same time?
The library calls the cudaSetDevice function upon library init. (There's no way i can change this).
Can I somehow determine the context and tell the driver API to use that one?
cuCtxGetCurrent() gets the currect context (that might be created by the runtime)