I have a binary of my program which is generated with nvcc compiler. I want to profile it with nvprof. I tried with nvprof ./a.out and it shows seconds for each function. While this is good for me, I want to see timeline of my application. I could have easily done this thing, if I was building my project with Nsight but unfortunately I can't do that. So, how can I invoke nvprof program outside Nsight in order to see timeline of my application?
Several ways that you can see the timeline:
In Nsight, click the profile button after compiling;
Use standalone GUI profile tool nvvp in CUDA, which can be launched by the following cmdline if /usr/local/cuda/bin (default CUDA installation binary dir) is in your $PATH. You can then lanuch your a.out in nvvp GUI to profile it and display the timeline.
$ nvvp
Use cmdline tool nvprof with -o option to generate the result file, which can be imported by Nsight and/or nvvp to display the timeline. the user manual of nvprof provides more details.
$ nvprof -o profile.result ./a.out
Related
I am using a remote machine, which has 2 GPU's, in order to execute a Python script which has CUDA code. In order to find where I can improve the performance of my code, I am trying to use nvprof.
I have set on my code that I only want to use one of the 2 GPU's on the remote machine, although, when calling nvprof --profile-child-processes ./myscript.py, a process with the same ID is started on each of the GPU's.
Is there any argument I can give nvprof in order to only use one GPU for the profiling?
As you have pointed out, you can use CUDA profilers to profile python codes simply by having the profiler run the python interpreter, running your script:
nvprof python ./myscript.py
Regarding the GPUs being used, the CUDA environment variable CUDA_VISIBLE_DEVICES can be used to restrict the CUDA runtime API to use only certain GPUs. You can try it like this:
CUDA_VISIBLE_DEVICES="0" nvprof --profile-child-processes python ./myscript.py
Also, nvprof is documented and also has command line help via nvprof --help. Looking at the command-line help, I see a --devices switch which appears to limit at least some functions to use only particular GPUs. You could try it with:
nvprof --devices 0 --profile-child-processes python ./myscript.py
For newer GPUs, nvprof may not be the best profiler choice. You should be able to use nsight systems in a similar fashion, for example via:
nsys profile --stats=true python ....
Additional "newer" profiler resources are linked here.
CUDA kernels are launched with this syntax (at least in the runtime API)
mykernel<<<blocks, threads, shared_mem, stream>>>(args);
Is this implemented as a macro or is it special syntax that nvcc removes before handing host code off to gcc?
The nvcc preprocessing system eventually converts it to a sequence of CUDA runtime library calls before handing the code off to the host code compiler for compilation. The exact sequence of calls may change depending on CUDA version.
You can inspect files using the --keep option to nvcc (and --verbose may help with understanding as well), and you can also see a trace of API calls issued for a kernel call using one of the profilers e.g. nvprof --print-api-trace ...
---EDIT---
Just to make this answer more concise, nvcc directly modifies the host code to replace the <<<...>>> syntax before passing it off to the host compiler (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#offline-compilation)
Since I did not have access to a nVIDIA card, I was using GPUOcelot to compile and run my programs. Since I had separated out my cuda kernel and the C++ programs in two separate files (since I was using C++11 features) I was doing the following to run my program.
nvcc -c my_kernel.cu -arch=sm_20
g++ -std=c++0x -c my_main.cpp
g++ my_kernel.o my_main.o -o latest_output.o 'OcelotConfig -l'
I have recently been given access to a Windows box which has a nVIDIA card. I downloaded the CUDA toolkit for windows and mingw g++. Now I run
nvcc -c my_kernel.cu -arch=sm_20
g++ -std=c++0x -c my_main.cpp
The nvcc call now instead of producing my_kernel.o produces my_kernel.obj. And when I try to link them and run using g++ as I did before
g++ my_kernel.obj my_main.o -o m
I get the following error:
my_kernel.obj: file not recognized: File format not recognized
collect2.exe: error: ld returned 1 status
Could you please resolve the problem? Thanks.
nvcc is a compiler wrapper that invokes the device compiler and the host compiler under the hood (it can also invoke the host linker, but you're using -c so not doing linking). On Windows, the supported host compiler is cl.exe from Visual Studio.
Linking two object files created with two different C++ compilers is typically not possible, even if you are just using CPU only. This is because the ABI is different. The error message you are seeing is simply telling you that the object file format from cl.exe (via nvcc) is incompatible with g++.
You need to compile my_main.cpp with cl.exe, if that's producing errors then that's a different question!
In order to compile project for Kepler K20 we need to set -rdc=true flag. How can we set this flag in nsight eclipse edition? my version is CUDA 5.K20. nsight eclipse edition
To enable separate compilation in Nsight Eclipse Edition:
Open project properties
In the left-hand tree, select "Build"/"CUDA"
Select "Separate Compilation" radio button at the top of the page.
Clean the project and rebuild.
This way Nsight will specify -rdc=true both to the compiler and linker.
For additional options like -rdc, you could add them directly in the command line of the compiler settings.
In your Nsight,
Project menu -> Properties -> Build -> Settings -> Tool Settings -> NVCC Compiler
append -rdc=true to either the string of "Command" or the string of "Command line pattern"
You may also need to append this option to the NVCC linker command.
with a very simple code, hello world, the breakpoint is not working.
I can't write the exact comment since it's not written in English,
but it's like 'the symbols of this document are not loaded' or something.
there's not cuda codes, just only one line printf in main function.
The working environment is windows7 64bit, vc++2008 sp1, cuda toolkit 3.1 64bits.
Please give me some explanation on this. :)
So this is just a host application (i.e. nothing to do with CUDA) doing printf that you can't debug? Have you selected "Debug" as the configuration instead of "Release"?
Are you trying to use a Visual Studio breakpoint to stop in your CUDA device code (.cu)? If that is the case, then I'm pretty sure that you can't do that. NVIDIA has released Parallel NSIGHT, which should allow you to do debugging of CUDA device code (.cu), though I don't have much experience with it myself.
Did you compile with -g -G options as noted in the documentation?
NVCC, the NVIDIA CUDA compiler driver, provides a mechanism for generating the debugging information necessary for CUDA-GDB to work properly. The -g -G option pair must be passed to NVCC when an application is compiled for ease of debugging with CUDA-GDB; for example,
nvcc -g -G foo.cu -o foo
here: https://docs.nvidia.com/cuda/cuda-gdb/index.html