I am wondering whether cuDNN has a device side api for dynamic parallelism (I want to call cuDNN kernels inside other kernels). I have found that cuBLAS has such an api: cuBlas but I could not find any information on whether cuDNN provides a similar api. I checked the cuDNN paper and it says that "The library exposes a host-callable C language API" but the paper is old and I want to know whether they added a device side api. I couldn't find any information in the cuDNN documentation. Is anyone aware of such an api?
There is no device API for cuDNN at this time. Currently there is also no device API for CUBLAS in recent CUDA releases. That functionality was deprecated and removed in the CUDA 10 timeframe.
Related
Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.
[1] CUDA Driver API vs. CUDA runtime
To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:
The CUDA Toolkit 3.0 Beta is now available.
Highlights for this release include:
CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.
There is documentation here which succinctly describes how the driver and runtime API interact.
To concretely answer your main question:
If one wants dynamic kernel definition as in cuModuleLoad and cublas
at the same time, what are the options?
The basic approach goes something like this:
Use the driver API to establish a context on the device as you would normally do.
Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment
Can anyone explain how the profiler works. How it measures all the time , instructions etc given the executable. I know how to run a profiler. I wanted to know its background working.
I want to develop a profiler of my own. So I need to understand how the existing profiler works.
I am provided with the executable and need to develop a profiler to profile the executable.
You can start by reading the CUPTI Documentation.
The CUDA Profiling Tools Interface (CUPTI) enables the creation of
profiling and tracing tools that target CUDA applications. CUPTI
provides four APIs: the Activity API, the Callback API, the Event API,
and the Metric API. Using these APIs, you can develop profiling tools
that give insight into the CPU and GPU behavior of CUDA applications.
CUPTI is delivered as a dynamic library on all platforms supported by
CUDA.
And CUPTI Metric API is what you should read, and you should always be aware of which CUDA version is your target, because some of the API are different than the previous or the next version.
Can anyone explain how to intercept calls to CUDA Runtime API?
I am a newbie and I have read a bit about linux library interception.
I want to use the same concept so that I can intercept Cuda Runtime Api.
The CUPTI SDK included in the CUDA Toolkit provides support for enabling callbacks on enter and exit of the CUDA runtime API. It is possible to do some modification to state in the callbacks but the current callback system does not allow you to modify the value of the parameters or to skip the real function.
If you need the ability to modify input and output parameters then I recommend you generate an interception layer. Doxygen perlmod and a fairly small perl script can be used to generate an interception layer.
I believe the ocelot source code has a full CUDA runtime interception layer.
On Linux you can use LD_PRELOAD to insert your interception layer into the application.
I am new to CUDA programming and don't know much about it. Can you please tell me what does 'CUDA compute capability' mean? When I use the following code on my university server, it showed me the following result.
for (device = 0; device < deviceCount; ++device)
{
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, device);
printf("\nDevice %d has compute capability %d.%d.\n", device, deviceProp.major, deviceProp.minor);
}
RESULT:
Device 0 has compute capability 4199672.0.
Device 1 has compute capability 4199672.0.
Device 2 has compute capability 4199672.0.
.
.
cudaGetDeviceProperties returns two fields major and minor. Can you please tell me what is this 4199672.0. means?
The compute capability is the "feature set" (both hardware and software features) of the device. You may have heard the NVIDIA GPU architecture names "Tesla", "Fermi" or "Kepler". Each of those architectures have features that previous versions might not have.
In your CUDA toolkit installation folder on your hard drive, look for the file CUDA_C_Programming_Guide.pdf (or google it), and find Appendix F.1. It describes the differences in features between the different compute capabilities.
As #dialer mentioned, the compute capability is your CUDA device's set of computation-related features. As NVidia's CUDA API develops, the 'Compute Capability' number increases. At the time of writing, NVidia's newest GPUs are Compute Capability 3.5. You can get some details of what the differences mean by examining this table on Wikipedia.
As #aland suggests, your call probably failed, and what you're getting is the result of using an uninitialized variable. You should wrap your cudaGetDeviceProps() call with some kind of error checking; see
What is the canonical way to check for errors using the CUDA runtime API?
for a discussion of the options for doing this.
is it possible to check if any CUDA devices are present before all cudaMalloc... commands are called?
im using C++ and i just want to print an error message before the program launches incase the user doesn't support cuda tech.
EDIT: if i can check it from C#, it will be even better.
thanks!
You can use cudaGetDeviceCount to get the number of cuda devices and use cuda device properties to retrieve your necessary compute capabilities.
A rather old version of the API documentation for cudaGetDeviceCount can be found here.