I am trying to get the profiling data for cuFFT library calls for example plan and exec. I am using nvprof (command line profiling tool), with option of "--print-api-trace". It prints the time for all the apis except the cuFFT apis. Is there a any flag i need to change to get the cuFFT profiling data ?
Or
I need to use the events and measure myself ??
According to the nvprof documentation, api-trace-mode:
API-trace mode shows the timeline of all CUDA runtime and driver API calls
cuFFT is neither the CUDA runtime API nor the CUDA driver API. It is a library of routines for FFT, whose documentation is here.
You can still use either nvprof, the command line profiler, or the visual profiler, to gather data about how cuFFT uses the GPU, of course.
Got it working.. Instead of using the nvprof i used the CUDA_PROFILE environment variable.
Related
I want to know, when the cuda code gets compiled? I mean is it possible to know the values of parameters of the cuda kernel which is given in the command line argument of host code run time? Is it possible to compile cuda code during run time of host code ?
In typical usage of the CUDA runtime API, CUDA device code gets compiled when you pass a file containing CUDA device code to nvcc, the CUDA compiler driver engine.
CUDA device code can/will be compiled at run-time using either the driver API or using the CUDA NVRTC mechanism. There is documentation for each of these approaches, CUDA sample codes for each of these approaches, and various questions here on the cuda SO tag for each.
When you use the CUDA driver API, the device source code you will present for compilation at run-time is in the form of PTX, a CUDA intermediate language.
For compilation of typical CUDA C++ device code at runtime, you would use the NVRTC mechanism.
I'm doing a simple experiment. Everyone may know about callback_metric sample code of CUPTI (located in CUPTI folder: /usr/local/cuda/extras/CUPTI/sample/callback_metric). It contains only a simple code for reading a metric when running a vectorAdd kernel. Everything works when I compile and run the code.
But when I run this code under nvprof command (nvprof ./callback_metric), I get an error message as:
Error: incompatible CUDA driver version
both nvprof and other CUPTI-based codes work fine separately.
The profilers are not intended to be used in this way with applications that make use of CUPTI.
This is documented in the profiler documentation:
Here are a couple of reasons why Visual Profiler may fail to gather metric or event information.
More than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read event values.
I am using the cuFFT library. How do I modify my code to see the function calls from this library (or any other CUDA library) in the NVIDIA Visual Profiler NVVP? I am using Windows and Visual Studio 2013.
Below is my code. I convert my image and filter to the Fourier domain, then perform point-wise complex matrix multiplication in a custom CUDA kernel I wrote, and then simply perform the inverse DFT on the filtered images spectrum. The results are accurate, but I am not able to figure out how to view the cuFFT functions in the profiler.
// Execute FFT Plans
cufftExecR2C(fftPlanFwd, (cufftReal *)d_in, (cufftComplex *)d_img_Spectrum);
cufftExecR2C(fftPlanFwd, (cufftReal *)d_filter, (cufftComplex *)d_filter_Spectrum);
// Perform complex pointwise muliplication on filter spectrum and image spectrum
pointWise_complex_matrix_mult_kernel << <grid, block >> >(d_img_Spectrum, d_filter_Spectrum, d_filtered_Spectrum, ROWS, COLS);
// Execute FFT^-1 Plan
cufftExecC2R(fftPlanInv, (cufftComplex *)d_filtered_Spectrum, (cufftReal *)d_out);
At the entry point to the library, the library call is like any other call into a C or C++ library: it is executing on the host. Within that library call, there may be calls to CUDA kernels or other CUDA API functions, for a CUDA GPU-enabled library such as CUFFT.
The profilers (at least up through CUDA 7.0 - see note about CUDA 7.5 nvprof below) don't natively support the profiling of host code. They are primarily focused on kernel calls and CUDA API calls. A call into a library like CUFFT by itself is not considered a CUDA API call.
You haven't shown a complete profiler output, but you should see the CUFFT library make CUDA kernel calls; these will show up in the profiler output. The first two CUFFT calls prior to your pointWise_complex_matrix_mult_kernel should have one or more kernel calls each that show up to the left of that kernel, and the last CUFFT call should have one or more kernel calls that show up to the right of that kernel.
One possible way to get specific sections of host code to show up in the profiler is to use the NVTX (NVIDIA Tools Extension) library to annotate your source code, which will cause those annotations to show up in the profiler output. You might want to put an NVTX range event around the library call you wish to see identified in the profiler output.
Another approach would be to try out the new CPU profiling features in nvprof in CUDA 7.5. You can refer to section 3.4 of the Profiler guide that ships with CUDA 7.5RC.
Finally, ordinary host profilers should be able to profile your CUDA application, including CUFFT library calls, but they won't have any visibility into what is happening on the GPU.
EDIT: Based on discussion in the comments below, your code appears to be similar to the simpleCUFFT sample code. When I compile and profile that code on Win7 x64, VS 2013 Community, and CUDA 7, I get the following output (zoomed in to depict the interesting part of the timeline):
You can see that there are CUFFT kernels being called both before and after the complex pointwise multiply and scale kernel that appears in that code. My suggestion would be to start by doing something similar with the simpleCUFFT sample code rather than your own code, and see if you can duplicate the output above. If so, the problem lies in your code (perhaps your CUFFT calls are failing, perhaps you need to add proper error checking, etc.)
Can anyone explain how the profiler works. How it measures all the time , instructions etc given the executable. I know how to run a profiler. I wanted to know its background working.
I want to develop a profiler of my own. So I need to understand how the existing profiler works.
I am provided with the executable and need to develop a profiler to profile the executable.
You can start by reading the CUPTI Documentation.
The CUDA Profiling Tools Interface (CUPTI) enables the creation of
profiling and tracing tools that target CUDA applications. CUPTI
provides four APIs: the Activity API, the Callback API, the Event API,
and the Metric API. Using these APIs, you can develop profiling tools
that give insight into the CPU and GPU behavior of CUDA applications.
CUPTI is delivered as a dynamic library on all platforms supported by
CUDA.
And CUPTI Metric API is what you should read, and you should always be aware of which CUDA version is your target, because some of the API are different than the previous or the next version.
I'm trying to use nvvp to profile opencl kernels.
I'm running ubuntu 12.04 64b with a GTX 580 and have verified the CUDA toolkit is working fine (i can run and profile cuda code).
When trying to debug my opencl code i get:
Warning: No CUDA application was profiled, exiting
Any hints?
Nvidia's visual profiler (nvvp) can be used to profile OpenCL programs, but it is more of a pain than profiling in CUDA directly.
Simon McIntosh's High Performance Computing group over at the University of Bristol came up with the original solution (here), and I can verify it works.
I'll summarise the basics:
Firstly, the environment variable COMPUTE_PROFILE must be set, this is done with COMPUTE_PROFILE=1
Secondly a COMPUTE_PROFILE_CONFIG must be provided, a sample I use (called nvvp.cfg) contains:
profilelogformat CSV
streamid
gpustarttimestamp
gpuendtimestamp
Next to perform the actual profiling, in this case I'll profile an OpenCL application called HuffFramework, using:
COMPUTE_PROFILE=1 COMPUTE_PROFILE_CONFIG=nvvp.cfg ./HuffFramework
This then generates a series of opencl_profile_*.log files, where * is the number of threads.
These log files can't be loaded by nvvp just yet as all kernel function symbols have a leading OPENCL_ instead of an expected CUDA_, thus replace these symbols with a quick script like so:
sed 's/OPENCL_/CUDA_/g' opencl_profile_0.log > cuda_profile_0.log
Finally cuda_profile_0.log can now be imported by nvvp, by starting nvvp and going File->Import...->Command-line Profiler, point it to cuda_profile_0.log and preso!
nvvp can only profile CUDA applications.