Profiling MPI+Cuda [duplicate] - cuda

This question already has answers here:
Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)
(4 answers)
Closed 8 years ago.
I'm developing a MPI+cuda project and I've tried to profile my app with nvvp and nvprof but, in both cases it doesn't give a profile. The app works completely fine, but no profile is generated.
nvprof mpirun -np 2 MPI_test
[...]
======== Warning: No CUDA application was profiled, exiting
I tried with simpleMPI cuda example with the same result.
I'm using CUDA 5.0 in a 580 GTX and openMPI 1.7.3 (featured, not release yet because I'm testing the CUDA-aware option)
Any ideas? Thank you very much.

mpirun itself is not a CUDA application. You have to run the profiler like mpirun -np 2 nvprof MPI_test. But you also have to make sure that each instance of nvprof (two instances in that case) is writing to a different output file. Open MPI exports the OMPI_COMM_WORLD_RANK environment variable that gives the process rank in MPI_COMM_WORLD. This could be used in just another wrapper, e.g. wrap_nvprof:
#!/bin/bash
nvprof -o profile.$OMPI_COMM_WORLD_RANK $*
This should be run like mpirun -n 2 ./wrap_nvprof executable <arguments> and after it has finished there should be two output files with profile information: profile.0 for rank 0 and profile.1 for rank 1.
Edit: There is an example nvprof wrapper script that does the same in a more graceful way and that handles both Open MPI and MVAPICH2 in the nvvp documentation. A version of the script is reproduced in this answer to a question that yours is more or less a duplicate of.

Related

Profiling Pycuda GPU kernel? [duplicate]

I am using a remote machine, which has 2 GPU's, in order to execute a Python script which has CUDA code. In order to find where I can improve the performance of my code, I am trying to use nvprof.
I have set on my code that I only want to use one of the 2 GPU's on the remote machine, although, when calling nvprof --profile-child-processes ./myscript.py, a process with the same ID is started on each of the GPU's.
Is there any argument I can give nvprof in order to only use one GPU for the profiling?
As you have pointed out, you can use CUDA profilers to profile python codes simply by having the profiler run the python interpreter, running your script:
nvprof python ./myscript.py
Regarding the GPUs being used, the CUDA environment variable CUDA_VISIBLE_DEVICES can be used to restrict the CUDA runtime API to use only certain GPUs. You can try it like this:
CUDA_VISIBLE_DEVICES="0" nvprof --profile-child-processes python ./myscript.py
Also, nvprof is documented and also has command line help via nvprof --help. Looking at the command-line help, I see a --devices switch which appears to limit at least some functions to use only particular GPUs. You could try it with:
nvprof --devices 0 --profile-child-processes python ./myscript.py
For newer GPUs, nvprof may not be the best profiler choice. You should be able to use nsight systems in a similar fashion, for example via:
nsys profile --stats=true python ....
Additional "newer" profiler resources are linked here.

nvprof is crashing as it writes a very large file to /tmp/ and runs out of disk space

How do I work-around an nvprof crash that occurs when running on a disk with a relatively small amount of space available?
Specifically, when profiling my cuda kernel, I use the following two commands:
# Generate the timeline
nvprof -f -o ~/myproj/profiling/timeline-`date -I`.out ~/myproj/build/myexe
# Generate profiling data
nvprof -f --kernels ::mykernel:1 --analysis-metrics -o ~/myproj/profiling/analysis-metrics-`date -I`.out ~/myproj/build/myexe
The first nvprof command works fine. The second nvprof needs to write a 12GB temporary file to /tmp before it can proceed. Since my 38GB cloud disk only has 6 GB available, nvprof crashes. Assuming I can't free up more disk space, how do I work around this problem?
Side note:
It's mostly irrelevant to diagnosing the problem, but nvprof is reporting an Error: Application received signal 7, which is a "Bus error (bad memory access)" (see http://man7.org/linux/man-pages/man7/signal.7.html for more info).
One can direct nvprof to use a different temporary directory by setting the TMPDIR environment variable. This is helpful, because since Linux kernel 2.6, there's a decent chance that you have a RAM disk available at /dev/shm (see https://superuser.com/a/45509/363816 for more info). Thus, adding the following at the beginning of one's [bash] script will likely work-around your issue.
export TMPDIR=/dev/shm

nvprof on a binary file

I have a binary of my program which is generated with nvcc compiler. I want to profile it with nvprof. I tried with nvprof ./a.out and it shows seconds for each function. While this is good for me, I want to see timeline of my application. I could have easily done this thing, if I was building my project with Nsight but unfortunately I can't do that. So, how can I invoke nvprof program outside Nsight in order to see timeline of my application?
Several ways that you can see the timeline:
In Nsight, click the profile button after compiling;
Use standalone GUI profile tool nvvp in CUDA, which can be launched by the following cmdline if /usr/local/cuda/bin (default CUDA installation binary dir) is in your $PATH. You can then lanuch your a.out in nvvp GUI to profile it and display the timeline.
$ nvvp
Use cmdline tool nvprof with -o option to generate the result file, which can be imported by Nsight and/or nvvp to display the timeline. the user manual of nvprof provides more details.
$ nvprof -o profile.result ./a.out

no mpicxx when compiling examples for NVIDIA CUDA 5

I installed the driver and toolkit for CUDA 5 in 64-bit RHEL 6.3 successfully.
However, when I tried compiling the CUDA 5 examples, I got the error message:
make[1]: Leaving directory `/root/NVIDIA_CUDA-5.0_Samples/0_Simple/cppIntegration'
which: no mpicxx
How can I fix this for the CUDA 5 examples to compile?
In order to build the simpleMPI example, you need some kind of MPI installed on your system. You can get around this and build most of the samples by doing:
make -k
this will attempt to go past errors in the make process and build all targets that can be built.
If you prefer, you can delete this directory:
/root/NVIDIA_CUDA-5.0_Samples/0_Simple/simpleMPI
perhaps with the following command, as root:
rm -Rf /root/NVIDIA_CUDA-5.0_Samples/0_Simple/simpleMPI
and relaunch your make. Personally I think the make -k option is simpler.
(the message about cppIntegration is just the last target that got successfully built)

breakpoints in cuda do not work!

with a very simple code, hello world, the breakpoint is not working.
I can't write the exact comment since it's not written in English,
but it's like 'the symbols of this document are not loaded' or something.
there's not cuda codes, just only one line printf in main function.
The working environment is windows7 64bit, vc++2008 sp1, cuda toolkit 3.1 64bits.
Please give me some explanation on this. :)
So this is just a host application (i.e. nothing to do with CUDA) doing printf that you can't debug? Have you selected "Debug" as the configuration instead of "Release"?
Are you trying to use a Visual Studio breakpoint to stop in your CUDA device code (.cu)? If that is the case, then I'm pretty sure that you can't do that. NVIDIA has released Parallel NSIGHT, which should allow you to do debugging of CUDA device code (.cu), though I don't have much experience with it myself.
Did you compile with -g -G options as noted in the documentation?
NVCC, the NVIDIA CUDA compiler driver, provides a mechanism for generating the debugging information necessary for CUDA-GDB to work properly. The -g -G option pair must be passed to NVCC when an application is compiled for ease of debugging with CUDA-GDB; for example,
nvcc -g -G foo.cu -o foo
here: https://docs.nvidia.com/cuda/cuda-gdb/index.html