Dependency Analysis options in CUDA Profiler - cuda

I have implemented a program that uses a single GPU using the cudaStreamWaitEvent() function to set dependency within two streams using events.
In order to verify this dependency, is it possible to use the "Dependency Analysis" view on the Nvidia Visual Profiler ?
If not, what does each of the following options in the dependency analysis view provide?
Focus Critical Path
Highlight Execution Dependencies
detailed information on those options doesn't seem to be available in the nvidia official website and here

Yes, you should be able to use the dependency analysis feature to verify your usage of most CUDA synchronization APIs, including cudaStreamWaitEvent.
To use either of the two mentioned options, you must have computed the dependencies in your application trace. In order to do that, in NVIDIA Visual Profiler, select "Unguided Analysis" and there "Dependency Analysis".
Now you can enable "Highlight Execution Dependencies", which will highlight the incoming and outgoing dependencies for each analyzed activity on the timeline in red, once you hover over it or select it.
If you use cudaStreamWaitEvent to block one kernel until another kernel in another independent stream has finished, those will be highlighted in red if they are direct dependencies.

Related

Can not profile a cuda code with nvprof when using CUPTI functions inside

I'm doing a simple experiment. Everyone may know about callback_metric sample code of CUPTI (located in CUPTI folder: /usr/local/cuda/extras/CUPTI/sample/callback_metric). It contains only a simple code for reading a metric when running a vectorAdd kernel. Everything works when I compile and run the code.
But when I run this code under nvprof command (nvprof ./callback_metric), I get an error message as:
Error: incompatible CUDA driver version
both nvprof and other CUPTI-based codes work fine separately.
The profilers are not intended to be used in this way with applications that make use of CUPTI.
This is documented in the profiler documentation:
Here are a couple of reasons why Visual Profiler may fail to gather metric or event information.
More than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read event values.

What is a CUDA context?

Can anyone please explain or refer me some good source about what is a CUDA context? I searched CUDA developer guide and I was not satisfied with it.
Any explanation or help will be great.
The cuda API exposes features of a stateful library: two consecutive calls relate one-another. In short, the context is its state.
The runtime API is a wrapper/helper of the driver API. You can see in the driver API that the context is explicitly made available, and you can have a stack of contexts for convenience. There is one specific context which is shared between driver and runtime API (See primary context)).
The context holds all the management data to control and use the device. For instance, it holds the list of allocated memory, the loaded modules that contain device code, the mapping between CPU and GPU memory for zero copy, etc.
Finally, note that this post is more from experience than documentation-proofed.
essentially, a data structure that holds information relevant to mantaining a consistent state between the calls that you make, e.g. (open) (execute) (close)
This is so that the functions that you invoke can send the signals in the right direction even if you don't specifically tell them what that direction is.

Can I profile OpenACC kernel in C source code level?

I'm trying to speed-up my code with openacc with PGI 15.7 compiler.
I want to profile my code in C source level.
I'm using 'nvvp' profiler from CUDA 7.0
When I run nvvp, I can use 'analysis tap' and can get which latency is the reason my code slows. (data dependency, conditional branch and bandwidth... etc)
But, I couldn't get line-based analysis, but only 'kernel' level analysis.
(e.g. main_300_gpu kernel used 10s) .
So I have some trouble to know where do I have to fix the code.
Is there any way to profile my code in source-level?
I'm using
PGI 15.7 (using pgcc)
CUDA 7.0
NVIDIA GTX 960
Ubuntu 14.04 LTS x86_64
[my nvvp reporting screenshots]
You can also try adding the flag "-ta=tesla:lineinfo" to have the compiler add source code association for the profiler (it's the same flag as nvcc --lineinfo). Though as Bob points out, the code may been heavily transformed so the line information many not directly correspond back to your original source.
At the current time (and on CUDA 7.5 or higher, with a cc5.2 or higher GPU), the nvvp profiler can associate various kinds of sampled execution activity with CUDA C/C++ lines of source code.
However, at the present time, this capability does not extend to OpenACC C/C++ (or Fortran) lines of source code.
It should still be possible to associate the activity with the disassembly, however, and it may be possible to associate with intermediate C source files produced by the PGI nollvm option. Niether of these will bear much resemblance to your OpenACC source code, however.
Another option for profiling OpenACC codes using PGI tools is to set the PGI_ACC_TIME=1 environment variable before executing your code. This will enable a lightweight profiler built into the runtime to give some analysis of the execution characteristics of your OpenACC code, in particular those parts associated with accelerator regions. The output is annotated so you can refer back to lines of your source code.

How does CUDA profiling work "under the hood"?

Can anyone explain how the profiler works. How it measures all the time , instructions etc given the executable. I know how to run a profiler. I wanted to know its background working.
I want to develop a profiler of my own. So I need to understand how the existing profiler works.
I am provided with the executable and need to develop a profiler to profile the executable.
You can start by reading the CUPTI Documentation.
The CUDA Profiling Tools Interface (CUPTI) enables the creation of
profiling and tracing tools that target CUDA applications. CUPTI
provides four APIs: the Activity API, the Callback API, the Event API,
and the Metric API. Using these APIs, you can develop profiling tools
that give insight into the CPU and GPU behavior of CUDA applications.
CUPTI is delivered as a dynamic library on all platforms supported by
CUDA.
And CUPTI Metric API is what you should read, and you should always be aware of which CUDA version is your target, because some of the API are different than the previous or the next version.

Dynamically detecting a CUDA enabled NVIDIA card and only then initializing the CUDA runtime: How to do?

I have an application which has an algorithm, accelerated with CUDA. There is also a standard CPU implementation of it. We plan to release this application for various platforms, so most of the time, there won't be a NVIDIA card to run the accelerated CUDA code. What I want is to first check whether the user's system has a CUDA enabled NVIDIA card and if it does, initializing the CUDA runtime after. If the system does not support CUDA, then I want to execute the CPU path. This question is very similar to mine, but I don't want to use any other libraries other than the plain CUDA runtime. OpenCL is an alternative, but there isn't enough time to implement an OpenCL version of the algorithm for the first release. Without any CUDA existence check, the program will surely crash since it can't find the needed .dll's for the CUDA runtime and we surely don't want that. So, I need advices on how to handle this initialization step.
Use the calls cudaGetDeviceCount and cudaGetDeviceProperties to find CUDA devices on the running system. First find out how many, then loop through all the available devices, and inspect the properties to decide which ones qualify. What I mean by "qualify" depends on your application. Do you want to require a certain compute capability? Or need a certain amount of memory? If there's more than one device, you might want to sort on some criteria, then set the device cudaSetDevice. If there are no devices, or none that are sufficient, then fall back on the CPU code path.
I'd also suggest having some mechanism to force CUDA mode off, in case some user's environment just doesn't work due to driver issues, or an old board, or something else. You can use a command-line option, or an environment variable, or whatever...
EDITING:
Regarding DLLs, you should package cudart[whatever].dll with your application. That will ensure that the program starts, and at least the CUDA query functions will operate.