Profiling Pycuda GPU kernel? [duplicate] - cuda

I am using a remote machine, which has 2 GPU's, in order to execute a Python script which has CUDA code. In order to find where I can improve the performance of my code, I am trying to use nvprof.
I have set on my code that I only want to use one of the 2 GPU's on the remote machine, although, when calling nvprof --profile-child-processes ./myscript.py, a process with the same ID is started on each of the GPU's.
Is there any argument I can give nvprof in order to only use one GPU for the profiling?

As you have pointed out, you can use CUDA profilers to profile python codes simply by having the profiler run the python interpreter, running your script:
nvprof python ./myscript.py
Regarding the GPUs being used, the CUDA environment variable CUDA_VISIBLE_DEVICES can be used to restrict the CUDA runtime API to use only certain GPUs. You can try it like this:
CUDA_VISIBLE_DEVICES="0" nvprof --profile-child-processes python ./myscript.py
Also, nvprof is documented and also has command line help via nvprof --help. Looking at the command-line help, I see a --devices switch which appears to limit at least some functions to use only particular GPUs. You could try it with:
nvprof --devices 0 --profile-child-processes python ./myscript.py
For newer GPUs, nvprof may not be the best profiler choice. You should be able to use nsight systems in a similar fashion, for example via:
nsys profile --stats=true python ....
Additional "newer" profiler resources are linked here.

Related

How can I execute a C program on Qemu riscv and observe the output?

What should be the best approach to run a C code in Qemu riscv and observe the output? I installed Qemu riscv following this link.What should I do now?
https://risc-v-getting-started-guide.readthedocs.io/en/latest/linux-qemu.html
You probably want to use the static user mode version of Qemu for most applications.
Then make sure to compile for RISC-V with the -static flag, and call qemu-riscv64-static [executable].
I highly recommend this, the system mode is a massive pain to handle if you don't need it
(have fun debugging the UART).
You can use libriscv to run RISC-V programs: https://github.com/fwsGonzo/libriscv
Inside the emulator folder there are 2 ways to build the emulator. build.sh produces emulators that run programs with no instruction listing. debug.sh produces debugging variant that shows the state of registers and instructions all the way through the program.
Building Qemu from sources is complete overkill.

How is the CUDA<<<...>>>() kernel launch syntax implemented

CUDA kernels are launched with this syntax (at least in the runtime API)
mykernel<<<blocks, threads, shared_mem, stream>>>(args);
Is this implemented as a macro or is it special syntax that nvcc removes before handing host code off to gcc?
The nvcc preprocessing system eventually converts it to a sequence of CUDA runtime library calls before handing the code off to the host code compiler for compilation. The exact sequence of calls may change depending on CUDA version.
You can inspect files using the --keep option to nvcc (and --verbose may help with understanding as well), and you can also see a trace of API calls issued for a kernel call using one of the profilers e.g. nvprof --print-api-trace ...
---EDIT---
Just to make this answer more concise, nvcc directly modifies the host code to replace the <<<...>>> syntax before passing it off to the host compiler (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#offline-compilation)

nvprof application not found

I am trying to use Nvidia nvprof to profile my CUDA and OpenCL programs. However, whatever benchmark I choose, the only output is ======== Error: application not found. I have tried both CUDA and OpenCL benchmarks, and recompiled them several times, but it seems helpless.
My CUDA version: 4.2
NVIDIA Driver version: 334.21
Different from AMD sprofile, ./ is needed before the application name on Linux.
So you can call the profiler with this command:
nvprof ./ApplicationName

Profiling MPI+Cuda [duplicate]

This question already has answers here:
Multi-GPU profiling (Several CPUs , MPI/CUDA Hybrid)
(4 answers)
Closed 8 years ago.
I'm developing a MPI+cuda project and I've tried to profile my app with nvvp and nvprof but, in both cases it doesn't give a profile. The app works completely fine, but no profile is generated.
nvprof mpirun -np 2 MPI_test
[...]
======== Warning: No CUDA application was profiled, exiting
I tried with simpleMPI cuda example with the same result.
I'm using CUDA 5.0 in a 580 GTX and openMPI 1.7.3 (featured, not release yet because I'm testing the CUDA-aware option)
Any ideas? Thank you very much.
mpirun itself is not a CUDA application. You have to run the profiler like mpirun -np 2 nvprof MPI_test. But you also have to make sure that each instance of nvprof (two instances in that case) is writing to a different output file. Open MPI exports the OMPI_COMM_WORLD_RANK environment variable that gives the process rank in MPI_COMM_WORLD. This could be used in just another wrapper, e.g. wrap_nvprof:
#!/bin/bash
nvprof -o profile.$OMPI_COMM_WORLD_RANK $*
This should be run like mpirun -n 2 ./wrap_nvprof executable <arguments> and after it has finished there should be two output files with profile information: profile.0 for rank 0 and profile.1 for rank 1.
Edit: There is an example nvprof wrapper script that does the same in a more graceful way and that handles both Open MPI and MVAPICH2 in the nvvp documentation. A version of the script is reproduced in this answer to a question that yours is more or less a duplicate of.

nvprof on a binary file

I have a binary of my program which is generated with nvcc compiler. I want to profile it with nvprof. I tried with nvprof ./a.out and it shows seconds for each function. While this is good for me, I want to see timeline of my application. I could have easily done this thing, if I was building my project with Nsight but unfortunately I can't do that. So, how can I invoke nvprof program outside Nsight in order to see timeline of my application?
Several ways that you can see the timeline:
In Nsight, click the profile button after compiling;
Use standalone GUI profile tool nvvp in CUDA, which can be launched by the following cmdline if /usr/local/cuda/bin (default CUDA installation binary dir) is in your $PATH. You can then lanuch your a.out in nvvp GUI to profile it and display the timeline.
$ nvvp
Use cmdline tool nvprof with -o option to generate the result file, which can be imported by Nsight and/or nvvp to display the timeline. the user manual of nvprof provides more details.
$ nvprof -o profile.result ./a.out