I am trying to get memory traces from cuda-gdb. However, I am not able to step into the kernel code. I use the nvcc flags -g -G and -keep but to no effect. I am able to put a breakpoint on the kernel function but when I try to access the next instruction, it jump to the end of the kernel function. I have tried this on the sdk examples and I observe the same behaviour. I am working on cuda 5 toolkit. Any suggestions?
Thanks!
This behavior is typical for kernel launch failure. Make sure you check return codes of the CUDA calls. Note that for debugging you may want to add additional call cudaDeviceSynchronize immediately after the kernel call and to check the return code from this call - it is the most precise way to obtain the cause of the asynchronous kernel launch failure.
Update: The code running outside of debugger but not in cuda-gdb most often is caused by trying to debug on a single-GPU system from graphical environment. cuda-gdb cannot share GPU with Xwindows as this would hang the OS.
You need to exit the graphical environment (e.g. quit X window) and debug from the console if your system only has one GPU.
If you have a multi-GPU system, then you should check your Xwindow configuration (Xorg.conf) so it does not use the GPU you reserve for debugging.
Related
I am using a remote machine, which has 2 GPU's, in order to execute a Python script which has CUDA code. In order to find where I can improve the performance of my code, I am trying to use nvprof.
I have set on my code that I only want to use one of the 2 GPU's on the remote machine, although, when calling nvprof --profile-child-processes ./myscript.py, a process with the same ID is started on each of the GPU's.
Is there any argument I can give nvprof in order to only use one GPU for the profiling?
As you have pointed out, you can use CUDA profilers to profile python codes simply by having the profiler run the python interpreter, running your script:
nvprof python ./myscript.py
Regarding the GPUs being used, the CUDA environment variable CUDA_VISIBLE_DEVICES can be used to restrict the CUDA runtime API to use only certain GPUs. You can try it like this:
CUDA_VISIBLE_DEVICES="0" nvprof --profile-child-processes python ./myscript.py
Also, nvprof is documented and also has command line help via nvprof --help. Looking at the command-line help, I see a --devices switch which appears to limit at least some functions to use only particular GPUs. You could try it with:
nvprof --devices 0 --profile-child-processes python ./myscript.py
For newer GPUs, nvprof may not be the best profiler choice. You should be able to use nsight systems in a similar fashion, for example via:
nsys profile --stats=true python ....
Additional "newer" profiler resources are linked here.
CUDA kernels are launched with this syntax (at least in the runtime API)
mykernel<<<blocks, threads, shared_mem, stream>>>(args);
Is this implemented as a macro or is it special syntax that nvcc removes before handing host code off to gcc?
The nvcc preprocessing system eventually converts it to a sequence of CUDA runtime library calls before handing the code off to the host code compiler for compilation. The exact sequence of calls may change depending on CUDA version.
You can inspect files using the --keep option to nvcc (and --verbose may help with understanding as well), and you can also see a trace of API calls issued for a kernel call using one of the profilers e.g. nvprof --print-api-trace ...
---EDIT---
Just to make this answer more concise, nvcc directly modifies the host code to replace the <<<...>>> syntax before passing it off to the host compiler (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#offline-compilation)
I have kernel image for Qemu but don't know for what machine it has configured to emulate? (for example vexpress-a9 or virt .,)
How do I find the configured machine by using kernel image?
If you have a working QEMU command line that goes with the kernel, you can look at what the -M/--machine option it uses is. Otherwise, if you have the kernel .config you can see which machines the kernel was built with support for. Otherwise, you need to ask whoever it was that built the kernel what they intended it to be used for, or just throw it away and get or build one that you know does work.
I'we been writing some simple cuda program (I'm student so I need to practice), and the thing is I can compile it with nvcc from terminal (using Kubuntu 12.04LTS) and then execute it with optirun ./a.out (hardver is geforce gt 525m on dell inspiron) and everything works fine. The major problem is that I can't do anything from Nsight. When I try to start debug version of code the message is "Launch failed! Binaries not found!". I think it's about running command with optirun but I'm not sure. Any similar experiences? Thanks, for helping in advance folks. :)
As this was the first post I found when searching for "nsight optirun" I just wanted to wanted to write down the steps I took to make it working for me.
Go to Run -> Debug Configurations -> Debugger
Find the textbox for CUDA GDB executable (in my case it was set to "${cuda_bin}/cuda-gdb")
Prepend "optirun --no-xorg", in my case it was then "optirun --no-xorg ${cuda_bin}/cuda-gdb"
The "--no-xorg" option might not be required or even counterproductive if you have an OpenGL window as it prevents any of that to appear. For my scientific code however it is required as it prevents me from running into kernel timeouts.
Happy bug hunting.
We tested Nsight on Optimus systems without optirun - see "Install the cuda toolkit" in CUDA Toolkit Getting Started on using CUDA toolkit on the Optimus system. We have not tried optirun with Nsight EE.
If you still need to use optirun for debugging, you can try making a shell script that uses optirun to start cuda-gdb and set that shell script as cuda-gdb executable in the debug configuration properties.
The simplest thing to do is to run eclipse with optirun, that will also run your app properly.
I'm pretty new to CUDA and flying a bit by the seat of my pants here...
I'm trying to debug my CUDA program on a remote machine I don't have admin rights on. I compile my program with nvcc -g -G and then try to debug it with cuda-gdb. However, as soon as gdb hits a call to a kernel (doesn't even have to enter it, and it doesn't happen in host code), I get:
(cuda-gdb) run
Starting program: /path/to/my/binary/cuda_clustered_tree
[Thread debugging using libthread_db enabled]
[1]+ Stopped cuda-gdb cuda_clustered_tree
cuda-gdb then dumps me back to my terminal. If I try to run cuda-gdb again, I get
An instance of cuda-gdb (pid 4065) is already using device 0. If you believe
you are seeing this message in error, try deleting /tmp/cuda-dbg/cuda-gdb.lock.
The only way to recover is to kill -9 cuda-gdb and cuda_clustered_ (I assume the latter is part of my binary).
This machine has two GPUs, is running CUDA 4.1 (I believe -- there were a lot installed, but that's the one I set the PATH and LD_LIBRARY_PATH to) and compile + runs deviceQuery and bandwidthTest fine.
I can provide more info if need be. I've searched everywhere I could find online and found no help with this.
Figured it out! Turns out, cuda-gdb hates csh.
If you are running csh, it will cause cuda-gdb to exhibit the above anomalous behavior. Even running bash from within csh, then running cuda-gdb, I still saw the behavior. You need to start your shell as bash, and only bash.
On the machine, the default shell was csh, but I use bash. I wasn't allowed to change it directly, so I added 'exec /bin/bash --login' to my .login script.
So even though I was running bash, because it was started by csh, cuda-gdb would exhibit the above anomalous behavior. Getting rid of 'exec' command, so I was running csh directly with nothing on top, still showed the behavior.
In the end, I had to get IT to change my shell to bash directly (after much patient troubleshooting by them.) Now it works as intended.