I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic parallelism to launch successive kernels from the initKernel, two of which are thrust::sort(thrust::device,...).
Before launching the initKernel, I do a cudaMalloc() on the host code and it is shown in the Runtime API of the Visual profiler. None of the cudaMallocs that appear in the __device__ functions and successive kernels (after the launch of initKernel) are shown in the Runtime API of the Visual profiler. Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?
Thank you for your time.
Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?
Because it is a documented limitation of the tool. From the documentation:
The Visual Profiler timeline does not display CUDA API calls invoked from within device-launched kernels.
Related
I'm doing a simple experiment. Everyone may know about callback_metric sample code of CUPTI (located in CUPTI folder: /usr/local/cuda/extras/CUPTI/sample/callback_metric). It contains only a simple code for reading a metric when running a vectorAdd kernel. Everything works when I compile and run the code.
But when I run this code under nvprof command (nvprof ./callback_metric), I get an error message as:
Error: incompatible CUDA driver version
both nvprof and other CUPTI-based codes work fine separately.
The profilers are not intended to be used in this way with applications that make use of CUPTI.
This is documented in the profiler documentation:
Here are a couple of reasons why Visual Profiler may fail to gather metric or event information.
More than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read event values.
I am using the cuFFT library. How do I modify my code to see the function calls from this library (or any other CUDA library) in the NVIDIA Visual Profiler NVVP? I am using Windows and Visual Studio 2013.
Below is my code. I convert my image and filter to the Fourier domain, then perform point-wise complex matrix multiplication in a custom CUDA kernel I wrote, and then simply perform the inverse DFT on the filtered images spectrum. The results are accurate, but I am not able to figure out how to view the cuFFT functions in the profiler.
// Execute FFT Plans
cufftExecR2C(fftPlanFwd, (cufftReal *)d_in, (cufftComplex *)d_img_Spectrum);
cufftExecR2C(fftPlanFwd, (cufftReal *)d_filter, (cufftComplex *)d_filter_Spectrum);
// Perform complex pointwise muliplication on filter spectrum and image spectrum
pointWise_complex_matrix_mult_kernel << <grid, block >> >(d_img_Spectrum, d_filter_Spectrum, d_filtered_Spectrum, ROWS, COLS);
// Execute FFT^-1 Plan
cufftExecC2R(fftPlanInv, (cufftComplex *)d_filtered_Spectrum, (cufftReal *)d_out);
At the entry point to the library, the library call is like any other call into a C or C++ library: it is executing on the host. Within that library call, there may be calls to CUDA kernels or other CUDA API functions, for a CUDA GPU-enabled library such as CUFFT.
The profilers (at least up through CUDA 7.0 - see note about CUDA 7.5 nvprof below) don't natively support the profiling of host code. They are primarily focused on kernel calls and CUDA API calls. A call into a library like CUFFT by itself is not considered a CUDA API call.
You haven't shown a complete profiler output, but you should see the CUFFT library make CUDA kernel calls; these will show up in the profiler output. The first two CUFFT calls prior to your pointWise_complex_matrix_mult_kernel should have one or more kernel calls each that show up to the left of that kernel, and the last CUFFT call should have one or more kernel calls that show up to the right of that kernel.
One possible way to get specific sections of host code to show up in the profiler is to use the NVTX (NVIDIA Tools Extension) library to annotate your source code, which will cause those annotations to show up in the profiler output. You might want to put an NVTX range event around the library call you wish to see identified in the profiler output.
Another approach would be to try out the new CPU profiling features in nvprof in CUDA 7.5. You can refer to section 3.4 of the Profiler guide that ships with CUDA 7.5RC.
Finally, ordinary host profilers should be able to profile your CUDA application, including CUFFT library calls, but they won't have any visibility into what is happening on the GPU.
EDIT: Based on discussion in the comments below, your code appears to be similar to the simpleCUFFT sample code. When I compile and profile that code on Win7 x64, VS 2013 Community, and CUDA 7, I get the following output (zoomed in to depict the interesting part of the timeline):
You can see that there are CUFFT kernels being called both before and after the complex pointwise multiply and scale kernel that appears in that code. My suggestion would be to start by doing something similar with the simpleCUFFT sample code rather than your own code, and see if you can duplicate the output above. If so, the problem lies in your code (perhaps your CUFFT calls are failing, perhaps you need to add proper error checking, etc.)
My CUDA program have too many kernel functions and if I open the CUDA debugging mode, I'll have to wait for an whole hour after the breakpoint in certain kernel function is triggered.
Is there any way for Nsight to start debugging after certain kernel functions, or only debug the certain kernel function?
I'm using Nsight with VS2012
In theory you can follow the instructions in the Nsight help file (either the online help or local help. At the time of writing the page is here).
In short:
In the Nsight Monitor options, CUDA » Use this Monitor for CUDA attach should be True.
Before starting your application, set an environment variable called NSIGHT_CUDA_DEBUGGER to 1.
Then in your CUDA kernel, you can add a breakpoint like this:
asm("brkpt;");
This will work similar to the __debugbreak() intrinsic or int 3 assembly instruction in host code. When hit you will get a dialog prompting you to attach the CUDA debugger.
In practice, at least for me it Just Doesn't Work™. Maybe you'll have more luck.
I wrote a simple cuda program in a .cu file. When I want to see the performance of this program. I choose "Nsight->Start Performance Analysis...." Then choose "Profile CUDA Application". After launching the application for a while and finishing capture, the report say "No kernel launches captured" The summary report say" 1 error encountered". Can someone help me to figure out why this happened?
Do you call cudaDeviceSynchronize() or cudaDeviceReset() after all the CUDA work is done in your sample? Otherwise Nsight cannot guarantee that all the launch and memcpy record buffers are flushed.
I'm trying to use nvvp to profile opencl kernels.
I'm running ubuntu 12.04 64b with a GTX 580 and have verified the CUDA toolkit is working fine (i can run and profile cuda code).
When trying to debug my opencl code i get:
Warning: No CUDA application was profiled, exiting
Any hints?
Nvidia's visual profiler (nvvp) can be used to profile OpenCL programs, but it is more of a pain than profiling in CUDA directly.
Simon McIntosh's High Performance Computing group over at the University of Bristol came up with the original solution (here), and I can verify it works.
I'll summarise the basics:
Firstly, the environment variable COMPUTE_PROFILE must be set, this is done with COMPUTE_PROFILE=1
Secondly a COMPUTE_PROFILE_CONFIG must be provided, a sample I use (called nvvp.cfg) contains:
profilelogformat CSV
streamid
gpustarttimestamp
gpuendtimestamp
Next to perform the actual profiling, in this case I'll profile an OpenCL application called HuffFramework, using:
COMPUTE_PROFILE=1 COMPUTE_PROFILE_CONFIG=nvvp.cfg ./HuffFramework
This then generates a series of opencl_profile_*.log files, where * is the number of threads.
These log files can't be loaded by nvvp just yet as all kernel function symbols have a leading OPENCL_ instead of an expected CUDA_, thus replace these symbols with a quick script like so:
sed 's/OPENCL_/CUDA_/g' opencl_profile_0.log > cuda_profile_0.log
Finally cuda_profile_0.log can now be imported by nvvp, by starting nvvp and going File->Import...->Command-line Profiler, point it to cuda_profile_0.log and preso!
nvvp can only profile CUDA applications.