My CUDA program have too many kernel functions and if I open the CUDA debugging mode, I'll have to wait for an whole hour after the breakpoint in certain kernel function is triggered.
Is there any way for Nsight to start debugging after certain kernel functions, or only debug the certain kernel function?
I'm using Nsight with VS2012
In theory you can follow the instructions in the Nsight help file (either the online help or local help. At the time of writing the page is here).
In short:
In the Nsight Monitor options, CUDA » Use this Monitor for CUDA attach should be True.
Before starting your application, set an environment variable called NSIGHT_CUDA_DEBUGGER to 1.
Then in your CUDA kernel, you can add a breakpoint like this:
asm("brkpt;");
This will work similar to the __debugbreak() intrinsic or int 3 assembly instruction in host code. When hit you will get a dialog prompting you to attach the CUDA debugger.
In practice, at least for me it Just Doesn't Work™. Maybe you'll have more luck.
Related
I am trying to write a program that runs almost entirely on the GPU (with very little interaction with the host). initKernel is the first kernel that is being launched from the host. I use Dynamic parallelism to launch successive kernels from the initKernel, two of which are thrust::sort(thrust::device,...).
Before launching the initKernel, I do a cudaMalloc() on the host code and it is shown in the Runtime API of the Visual profiler. None of the cudaMallocs that appear in the __device__ functions and successive kernels (after the launch of initKernel) are shown in the Runtime API of the Visual profiler. Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?
Thank you for your time.
Can someone help me understand why I cannot see the cudaMallocs in the Visual profiler?
Because it is a documented limitation of the tool. From the documentation:
The Visual Profiler timeline does not display CUDA API calls invoked from within device-launched kernels.
I'm doing a simple experiment. Everyone may know about callback_metric sample code of CUPTI (located in CUPTI folder: /usr/local/cuda/extras/CUPTI/sample/callback_metric). It contains only a simple code for reading a metric when running a vectorAdd kernel. Everything works when I compile and run the code.
But when I run this code under nvprof command (nvprof ./callback_metric), I get an error message as:
Error: incompatible CUDA driver version
both nvprof and other CUPTI-based codes work fine separately.
The profilers are not intended to be used in this way with applications that make use of CUPTI.
This is documented in the profiler documentation:
Here are a couple of reasons why Visual Profiler may fail to gather metric or event information.
More than one tool is trying to access the GPU. To fix this issue please make sure only one tool is using the GPU at any given point. Tools include the CUDA command line profiler, Parallel NSight Analysis Tools and Graphics Tools, and applications that use either CUPTI or PerfKit API (NVPM) to read event values.
I wrote a simple cuda program in a .cu file. When I want to see the performance of this program. I choose "Nsight->Start Performance Analysis...." Then choose "Profile CUDA Application". After launching the application for a while and finishing capture, the report say "No kernel launches captured" The summary report say" 1 error encountered". Can someone help me to figure out why this happened?
Do you call cudaDeviceSynchronize() or cudaDeviceReset() after all the CUDA work is done in your sample? Otherwise Nsight cannot guarantee that all the launch and memcpy record buffers are flushed.
I am trying to profile the CUDA rodinia benchmarks executing on a GTX 650.
I am using the code /usr/local/cuda-5.0/extras/CUPTI/samples/event_sampling to read
the instructions executed counter. It seems strange that I do not see any change in the
values reported by the event_sampling whether I am executing the CUDA benchmarks or not.
The event_sampling code also has some calculations of its own for which it measures the instructions executed. Unlike CPU, do I need to make changes to the source code of the application to be able to read the GPU counters such as instruction_executed?
CUPTI will only give you counter updates for kernels in the same process. You can get some of these values, though not to the same level of precision, with the NVIDIA visual profiler or related environment variables without modifying the code however.
I'm new to NSIGHT and CUDA. I tried to set a breakpoint inside my CUDA kernel code, but I can't--the breakpoint is set at the end of my kernel and not on the particular line I want to debug.
I'm using VS2010 (MFC project) with NSIGHT 2.2 and CUDA 4.2.
I'm compiling in debug mode.
I'm using CUDA in a project which is not the "StratUp project".
I'm using "Generate Host Debug Information" with "Yes (-g)"
I'm using "Generate Device Debug Information" with "Yes (-G)"
I am currently running the program through Menu->Nsight->Start CUDA debugging.
When I try to set a breakpoint on a different project (which is "StartUp project"), i do succeed.
Any suggestions about how I can get the breakpoint to act on a particular line, versus the entire kernel?
I used too many threads (256X256) to activate my kernel.
dim3 threads(256,256)
(kernel<<<...,threads>>>
It is important to note that when debugging CUDA, breakpoints set in device code will not work properly if the number of cores on your machine is greater than the number of CUDA threads being run. Additionally, if the number of CUDA threads is not evenly divisible by the number of cores, some cores will not hit device code breakpoints on the last iteration.