CUDA Driver API - minimum driver version? - cuda

I know that each CUDA toolkit has a minimum required driver, what I'm wondering is the following: suppose I'm loading each function pointer for each driver API function (e.g. cuInit) via dlsym from libcuda.so. I use no runtime API and neither link against cudart. My kernel uses virtual architectures to be JIT-ted at runtime (and the architecture is quite low, e.g. compute_30 so that I'm content with any kepler-and-above device).
Does the minimum driver required restriction still apply in my case?

Yes, there is still a minimum driver version requirement.
The GPU driver has a CUDA version that it is designed to be compatible with. This can be discovered in a variety of ways, one of which is to run the deviceQuery (or deviceQueryDrv) sample code.
Therefore a particular GPU driver will have a "compatibility" associated with a particular CUDA version.
In order to run correctly, Driver API codes will require an installed GPU Driver that is compatible with (i.e. has a CUDA compatibility version equal to or greater than) the CUDA version that the Driver API code was compiled against.
The CUDA/GPU Driver compatibility relationships, and the concept of forward compatibility, are similar to what is described in this question/answer.
To extend/generalize the ("forward") compatibility relationship statement from the previous answer, newer GPU Driver versions are generally compatible with older CUDA codes, whether those codes were compiled against the CUDA Runtime or CUDA Driver APIs.

Related

Benefit of higher version of CUDA for devices with lower Compute Capability

I'm using CUDA 7.0 on a Tesla K20X (C.C. 3.5). Is there any benefit to update to a higher version of CUDA, say 8.0. Is there any compatibility or stability risk with using higher version of CUDA with devices with (much) lower C.C.?
(Various available versions of CUDA on Nvidia website make me doubtful which one is really good)
Regarding benefits, newer CUDA toolkit versions usually provide feature benefits (new features, and/or enhanced performance) over previous CUDA toolkit version. However there are also occasionally performance regressions. Specifics can't be given - it may vary based on your exact code. However there are generally summary blog articles for each new CUDA toolkit version, for example here is the one for CUDA 8 and here is the one for CUDA 9, describing the new features available.
Regarding compatibility, there should be no risk to moving to a higher CUDA version, regardless of the compute capability of your device, as long as your device is supported. All current CUDA versions in the range of 7-9 support your cc3.5 GPU.
Regarding stability, it is possible that a newer CUDA version may have a bug, but it is also possible that a bug in your existing CUDA version may be fixed in a newer version. Guarantees can't be made here; software almost always has bugs in it. However it is generally recommended to use the latest CUDA version compatible with your GPU (in the absence of other considerations), as this gives you access the latest features and at least gives you the best possibility that a historically known issue has been addressed.
I doubt these sort of platitudes are any different regardless of the software stack (e.g. compiler, tools framework, etc.) that you are using. I don't think these considerations are specific or unique to CUDA.
I'm using CUDA 7.0 on a Tesla K20X (C.C. 3.5). Is there any benefit to update to a higher version of CUDA, say 8.0 ?
Are you kidding me? There are enormous benefits. It's a world of difference! Just have a look at the CUDA 8 feature descriptions (Parallel4All blog entry). Specifically,
CUDA 8.0 lets you compile with GCC 5.x instead of 4.x
Not only does that save you a life full of pain having to build your own GCC - since modern distros often don't package it at all, and it's not the system's default compiler. Also, GCC 5.x has lots of improvements, not the least of which being full C++14 support for host-side code.
CUDA 8 lets you use C++11 lambdas in device code
(actually, CUDA 7.5 lets you do that and this is rounded off in CUDA 8)
NVCC internal improvements
Not that I can list these, but hopefully NVIDIA continues working on its compiler, equipping it with better optimization logic.
Much faster compilation
NVCC is markedly faster with CUDA 8. It might be up to 2x, but even if it's just 1.5x - that really improves your quality of life as a developer...
Shall I go on? ... all of the above applies regardless of your compute capability. And CC 3.5 or 3.7 is nothing to sneeze at anyway.

cuda 7.0: maximum nvidia driver version

I have access to a computation server which uses an old version of the nvidia driver (346) and cuda (7.0) with applications depending on that specific version of cuda.
Is it possible to upgrade the driver and keep the old cuda?
I could find minimal driver versions but not maximal one.
CUDA generally doesn't enforce any maximum driver version.
Older CUDA toolkits are usable with newer drivers.
The only thing somewhat relevant here is that eventually, from time to time, NVIDIA GPU architectures become "deprecated", and this usually happens first at the driver level. That is, a particular GPU may only be supported up to a certain driver level, at which point support ceases. These GPUs are then in a "legacy" status.
So if your GPU is old enough, it will not be supported by newer/latest drivers. But if you currently have CUDA 7 running correctly, you would have to at least have a Fermi GPU, which is still supported by newest/latest drivers. However Fermi is probably/likely the next GPU family to go into a legacy status, at some point in the future.

Query whether CUDA device supports 32-bit or 64-bit addressing

I would like to discover, at runtime, whether a CUDA GPU supports 32-bit or 64-bit addressing. For context, I'm using LLVM to generate PTX at runtime, and need to know whether to set the target triple to nvptx or nvptx64.
There doesn't appear to be a direct query for this via cuDeviceGetAttribute, but is there some other query or heuristic that can give me this information?
64 bit addressing is a hard requirement for unified addressing to work. Also all NVidia GPUs that are 64 bit addressing capable do support unified addressing. So testing if unified addressing is supported for a given device context also tells if 64 bit addressing is supported.
The field unifiedAddressing of struct cudaDevice prop queried with cudaGetDeviceProperties gives that information.

restrict OpenCL access to Intel CPU?

It is currently possible to restrict OpenCL access to an NVIDIA GPU on Linux using the CUDA_VISIBLE_DEVICES env variable. Is anyone aware of a similar way to restrict OpenCL access to Intel CPU devices? (Motivation: I'm trying to force users of a compute server to run their OpenCL programs through SLURM exclusively.)
One possibility is to link directly to the Intel OpenCL library (libintelocl.so on my system) instead of going through the OpenCL ICD loader.
In pure OpenCL, the way to avoid assigning tasks to the CPU is to not select it (as platform or device). clGetDeviceIDs can do that using the device_type argument (don't set the CL_DEVICE_TYPE_CPU bit).
At the ICD level, I guess you could exclude the CPU driver if it's Intel's implementation; for AMD, it gets a little trickier since they have one driver for both platforms (it seems the CPU_MAX_COMPUTE_UNITS environment variable can restrict it to one core, but not disable it).
If the goal is to restrict OpenCL programs to running through a specific launcher, such as slurm, one way might be to add a group for that launcher and just make the OpenCL ICD vendor files in /etc/OpenCL (and possibly driver device nodes) usable only by that group.
None of this would prevent a user from having their own OpenCL implementation in place to run on CPU, but it could be enough to guide them to not run there by mistake.

cuda runtime api and dynamic kernel definition

Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.
[1] CUDA Driver API vs. CUDA runtime
To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:
The CUDA Toolkit 3.0 Beta is now available.
Highlights for this release include:
CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.
There is documentation here which succinctly describes how the driver and runtime API interact.
To concretely answer your main question:
If one wants dynamic kernel definition as in cuModuleLoad and cublas
at the same time, what are the options?
The basic approach goes something like this:
Use the driver API to establish a context on the device as you would normally do.
Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment