I cannot set breakpoint in CUDA kernel - cuda

I'm new to NSIGHT and CUDA. I tried to set a breakpoint inside my CUDA kernel code, but I can't--the breakpoint is set at the end of my kernel and not on the particular line I want to debug.
I'm using VS2010 (MFC project) with NSIGHT 2.2 and CUDA 4.2.
I'm compiling in debug mode.
I'm using CUDA in a project which is not the "StratUp project".
I'm using "Generate Host Debug Information" with "Yes (-g)"
I'm using "Generate Device Debug Information" with "Yes (-G)"
I am currently running the program through Menu->Nsight->Start CUDA debugging.
When I try to set a breakpoint on a different project (which is "StartUp project"), i do succeed.
Any suggestions about how I can get the breakpoint to act on a particular line, versus the entire kernel?

I used too many threads (256X256) to activate my kernel.
dim3 threads(256,256)
(kernel<<<...,threads>>>

It is important to note that when debugging CUDA, breakpoints set in device code will not work properly if the number of cores on your machine is greater than the number of CUDA threads being run. Additionally, if the number of CUDA threads is not evenly divisible by the number of cores, some cores will not hit device code breakpoints on the last iteration.

Related

launch more than 65536 blocks per grid (x dimension) in CUDA

I have a CUDA code that I am launching from a mex file in visual studios. I am only launching blocks in the x dimension, but get an error if I try to launch more than 65536 blocks, despite the fact that my compute capacity is 6.1 (according to the GPU devices tab under system info).
Also under system info it says MAX_GRID_DIM_X is 2147483647. Is there some setting or environmental variable I need to change before I can launch this many blocks? What other things might be limiting the number of blocks I can launch?
Is there some setting or environmental variable I need to change before I can launch this many blocks?
No.
What other things might be limiting the number of blocks I can launch?
Compilation settings. You must choose a target compilation architecture which supports 2^31-1. On CUDA 9, the default compilation architecture is 3.0 and this supports extended 1D grid sizes. On older toolkits, the default will be 2.0 or older, and these do not.

let Nsight start debugging after certain kernel function is executed

My CUDA program have too many kernel functions and if I open the CUDA debugging mode, I'll have to wait for an whole hour after the breakpoint in certain kernel function is triggered.
Is there any way for Nsight to start debugging after certain kernel functions, or only debug the certain kernel function?
I'm using Nsight with VS2012
In theory you can follow the instructions in the Nsight help file (either the online help or local help. At the time of writing the page is here).
In short:
In the Nsight Monitor options, CUDA » Use this Monitor for CUDA attach should be True.
Before starting your application, set an environment variable called NSIGHT_CUDA_DEBUGGER to 1.
Then in your CUDA kernel, you can add a breakpoint like this:
asm("brkpt;");
This will work similar to the __debugbreak() intrinsic or int 3 assembly instruction in host code. When hit you will get a dialog prompting you to attach the CUDA debugger.
In practice, at least for me it Just Doesn't Work™. Maybe you'll have more luck.

Kernel Conditional Breakpoints in Nsight Eclipse

I'm running CUDA 5.5 on an SUSE Linux machine with 2 M2050 cards installed, neither of which are used for running X11. I am trying to step through a kernel that is specifically only using device 0 using the Nsight Eclipse debugger. If I set an (unconditional) breakpoint inside a kernel, the debugger breaks on block 0/thread 0 first, and then if I continue execution it will break again at the same point 5 or 6 times on seemingly random threads in different blocks, before exiting the kernel and continuing to the next kernel. The program execution in the kernel is happening correctly and displayed correctly. The host code debugs without problems.
If I make the same breakpoint conditional, as outlined in this post:
using nsight to debug
I am seeing no difference in the behavior of the debugger. The condition on the breakpoint seems to be ignored and the debugger breaks on 5 or 6 random threads before exiting the kernel. Neither of these behaviors seem to make much sense to me. I would think the unconditional breakpoint should break on thread 0 or all threads. And I would think that the conditional breakpoint should break on only the thread it's conditioned on. I've looked all over the NVIDIA documentation, stackoverflow etc and seem to have exhausted my options at this point. I was wondering if anyone else has seen similar behavior or might have some pointers.
Unconditional breakpoint breaks for every new "batch" of threads arriving to the device. This is needed so you can explore all your threads.
Because of some technical issues, conditional breakpoints should be set after you break in kernel at least once. This will be fixed in CUDA Toolkit 6.0.

Time between Kernel Launch and Kernel Execution

I'm trying to optimize my CUDA programm by using the Parallel Nsight 2.1 edition for VS 2010.
My program runs on a Windows 7 (32 bit) machine with a GTX 480 board. I have installed the CUDA 4.1 32 bit toolkit and the 301.32 driver.
One cycle in the program consits of a copy of host data to the device, execution of the kernels and copy of the results from the device to the host.
As you can see in the picture of the profiler results below, the kernels run in four different streams. The kernel in each stream rely on the data copied to the device in 'Stream 2'. That's why the asyncMemcpy is synchronized with the CPU before launch of the Kernels in the different streams.
What irritates me in the picture is the big gap between the end of the first kernel launch (at 10.5778679285) and the beginning of the kernel execution (at 10.5781500). It takes around 300 us to launch the kernel which is a huge overhead in a processing cycle of less than 1 ms.
Furthermore there is no overlapping of kernel execution and the data copy of the results back to the host, which increases the overhead even more.
Are there any obvious reasons for this behavior?
There are three issues that I can tell by the trace.
Nsight CUDA Analysis adds about 1 µs per API call. You have both CUDA runtime and CUDA Driver API trace enabled. If you were to disable CUDA runtime trace I would guess that you would reduce the width by 50 µs.
Since you are on GTX 480 on Windows 7 you are executing on the WDDM driver model. On WDDM the driver must make a kernel call to submit work which introduces a lot of overhead. To avoid reduce this overhead the CUDA driver buffers requests in an internal SW queue and sends the requests to the driver when the queue is full you it is flushed by a synchronize call. It is possible tu use cudaEventQuery to force the driver to flush the work but this can have other performance implications.
It appears you are submitting your work to streams in a depth first manner. On compute capability 2.x and 3.0 devices you will have better results if you submit to streams in a breadth first manner. In your case you may see overlap between your kernels.
The timeline screenshot does not provide sufficient information for me to determine why the memory copies are starting after completion of all of the kernels. Given the API call pattern I you should be able to see transfers starting after each streams completes its launch.
If you are waiting on all streams to complete it is likely faster to do a cudaDeviceSynchronize than 4 cudaStreamSynchronize calls.
The next version of Nsight will have additional features to help understand the SW queuing and the submission of work to the compute engine and the memory copy engine.

HELP! CUDA kernel will no longer launch after using too much memory

I'm writing a program that requires the following kernel launch:
dim3 blocks(16,16,16); //grid dimensions
dim3 threads(32,32); //block dimensions
get_gaussian_responses<<<blocks,threads>>>(pDeviceIntegral,itgStepSize,pScaleSpace);
I forgot to free the pScaleSpace array at the end of the program, and then ran the program through the CUDA profiler, which runs it 15 times in succession, using up a lot of memory / causing a lot of fragmentation. Now whenever I run the program, the kernel doesn't even launch. If I look at the list of function calls recorded by the profiler, the kernel is not there. I realize this is a pretty stupid error, but I don't know what I can do at this point to get the program to run again. I have restarted my computer, but that did not help. If I reduce the dimensions of the kernel, it runs fine, but the current dimensions are well within the allowed maximum for my card.
Max threads per block: 1024
Max grid dimensions: 65535,65535,65535
Any suggestions appreciated, thanks in advance!
Try launching with lesser number of threads. If that works, it means that each of your threads is doing a lot of work or using a lot of memory. Thus the maximum possible number of threads cannot possibly be practically launched by CUDA on your hardware.
You may have to make your CUDA code more efficient to be able to launch more threads. You could try slicing your kernel into smaller pieces if it has complex logic inside it. Or get more powerful hardware.
If you compile your code like this:
nvcc -Xptxas="-v" [other compiler options]
the assembler will report the number of local heap memory that the code requires. This can be a useful diagnostic to see what the memory footprint of the kernel is. There is also an API call cudaThreadSetLimit which can be used to control the amount of per thread heap memory which a kernel will try and consume during execution.
Recent toolkits ship with a utility called cuda-memchk, which provides valgrind like analysis of kernel memory access, including buffer overflows and illegal memory usage. It might be that your code is overflowing some memory somewhere and overwriting other parts of GPU memory, leaving the card in a parlous state.
I got it! nVidia NSight 2.0 - which supposedly supports CUDA 4 - changed my CUDA_INC_PATH to use CUDA 3.2. No wonder it wouldn't let me allocate 1024 threads per block. All relief and jubilation aside, that is a really stupid and annoying bug considering I already had CUDA 4.0 RC2 installed.