How does CUDA profiling work "under the hood"? - cuda

Can anyone explain how the profiler works. How it measures all the time , instructions etc given the executable. I know how to run a profiler. I wanted to know its background working.
I want to develop a profiler of my own. So I need to understand how the existing profiler works.
I am provided with the executable and need to develop a profiler to profile the executable.

You can start by reading the CUPTI Documentation.
The CUDA Profiling Tools Interface (CUPTI) enables the creation of
profiling and tracing tools that target CUDA applications. CUPTI
provides four APIs: the Activity API, the Callback API, the Event API,
and the Metric API. Using these APIs, you can develop profiling tools
that give insight into the CPU and GPU behavior of CUDA applications.
CUPTI is delivered as a dynamic library on all platforms supported by
CUDA.
And CUPTI Metric API is what you should read, and you should always be aware of which CUDA version is your target, because some of the API are different than the previous or the next version.

Related

Can not profile a cuda code with nvprof when using CUPTI functions inside

I'm doing a simple experiment. Everyone may know about callback_metric sample code of CUPTI (located in CUPTI folder: /usr/local/cuda/extras/CUPTI/sample/callback_metric). It contains only a simple code for reading a metric when running a vectorAdd kernel. Everything works when I compile and run the code.
But when I run this code under nvprof command (nvprof ./callback_metric), I get an error message as:
Error: incompatible CUDA driver version
both nvprof and other CUPTI-based codes work fine separately.
The profilers are not intended to be used in this way with applications that make use of CUPTI.
This is documented in the profiler documentation:
Here are a couple of reasons why Visual Profiler may fail to gather metric or event information.
More than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read event values.

cuda runtime api and dynamic kernel definition

Using the driver api precludes the usage of the runtime api in the same application ([1]) . Unfortunately cublas, cufft, etc are all based on the runtime api. If one wants dynamic kernel definition as in cuModuleLoad and cublas at the same time, what are the options? I have these in mind, but maybe there are more:
A. Wait for compute capability 3.5 that's rumored to support peaceful coexistence of driver and runtime apis in the same application.
B. Compile the kernels to an .so file and dlopen it. Do they get unloaded on dlcose?
C. Attempt to use cuModuleLoad from the driver api, but everything else from the runtime api. No idea if there is any hope for this.
I'm not holding my breath, because jcuda or pycuda are in pretty much the same bind and they probably would have figured it out already.
[1] CUDA Driver API vs. CUDA runtime
To summarize, you are tilting at windmills here. By relying on extremely out of date information, you seem to have concluded that runtime and driver API interoperability isn't supported in CUDA, when, in fact, it has been since the CUDA 3.0 beta was released in 2009. Quoting from the release notes of that version:
The CUDA Toolkit 3.0 Beta is now available.
Highlights for this release include:
CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.
There is documentation here which succinctly describes how the driver and runtime API interact.
To concretely answer your main question:
If one wants dynamic kernel definition as in cuModuleLoad and cublas
at the same time, what are the options?
The basic approach goes something like this:
Use the driver API to establish a context on the device as you would normally do.
Call the runtime API routine cudaSetDevice(). The runtime API will automagically bind to the existing driver API context. Note that device enumeration is identical and common between both APIs, so if you establish context on a given device number in the driver API, the same number will select the same GPU in the driver API
You are now free to use any CUDA runtime API call or any library built on the CUDA runtime API. Behaviour is the same as if you relied on runtime API "lazy" context establishment

How to read power consumption using CUPTI?

I know that there's a way to read the power consumption of a GPU using CUPTI. Do you know of any method I can use? and where I can find examples?
Probably what you are looking for is the cupti ActivityEnvironment data.
As far as I know, this particular data category is new in CUDA 5.5, so you may need to be sure you are using CUDA 5.5 to access these parameters.
Collecting this data is part of the cupti Activity API
An example of the usage of this API is given in the activity_trace_async example that is included in the CUPTI toolkit.
On a standard linux install, this sample would be located at /usr/local/cuda/extras/CUPTI/sample/activity_trace_async
The NVIDIA Management Library (NVML)1 is an API that allows you to read the power consumption of the GPU among other information.
It has C and python bindings.

CUDA runtime api interception

Can anyone explain how to intercept calls to CUDA Runtime API?
I am a newbie and I have read a bit about linux library interception.
I want to use the same concept so that I can intercept Cuda Runtime Api.
The CUPTI SDK included in the CUDA Toolkit provides support for enabling callbacks on enter and exit of the CUDA runtime API. It is possible to do some modification to state in the callbacks but the current callback system does not allow you to modify the value of the parameters or to skip the real function.
If you need the ability to modify input and output parameters then I recommend you generate an interception layer. Doxygen perlmod and a fairly small perl script can be used to generate an interception layer.
I believe the ocelot source code has a full CUDA runtime interception layer.
On Linux you can use LD_PRELOAD to insert your interception layer into the application.

cuFFT profiling issue

I am trying to get the profiling data for cuFFT library calls for example plan and exec. I am using nvprof (command line profiling tool), with option of "--print-api-trace". It prints the time for all the apis except the cuFFT apis. Is there a any flag i need to change to get the cuFFT profiling data ?
Or
I need to use the events and measure myself ??
According to the nvprof documentation, api-trace-mode:
API-trace mode shows the timeline of all CUDA runtime and driver API calls
cuFFT is neither the CUDA runtime API nor the CUDA driver API. It is a library of routines for FFT, whose documentation is here.
You can still use either nvprof, the command line profiler, or the visual profiler, to gather data about how cuFFT uses the GPU, of course.
Got it working.. Instead of using the nvprof i used the CUDA_PROFILE environment variable.