CUDA kernels are launched with this syntax (at least in the runtime API)
mykernel<<<blocks, threads, shared_mem, stream>>>(args);
Is this implemented as a macro or is it special syntax that nvcc removes before handing host code off to gcc?
The nvcc preprocessing system eventually converts it to a sequence of CUDA runtime library calls before handing the code off to the host code compiler for compilation. The exact sequence of calls may change depending on CUDA version.
You can inspect files using the --keep option to nvcc (and --verbose may help with understanding as well), and you can also see a trace of API calls issued for a kernel call using one of the profilers e.g. nvprof --print-api-trace ...
---EDIT---
Just to make this answer more concise, nvcc directly modifies the host code to replace the <<<...>>> syntax before passing it off to the host compiler (https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#offline-compilation)
Related
I am using a remote machine, which has 2 GPU's, in order to execute a Python script which has CUDA code. In order to find where I can improve the performance of my code, I am trying to use nvprof.
I have set on my code that I only want to use one of the 2 GPU's on the remote machine, although, when calling nvprof --profile-child-processes ./myscript.py, a process with the same ID is started on each of the GPU's.
Is there any argument I can give nvprof in order to only use one GPU for the profiling?
As you have pointed out, you can use CUDA profilers to profile python codes simply by having the profiler run the python interpreter, running your script:
nvprof python ./myscript.py
Regarding the GPUs being used, the CUDA environment variable CUDA_VISIBLE_DEVICES can be used to restrict the CUDA runtime API to use only certain GPUs. You can try it like this:
CUDA_VISIBLE_DEVICES="0" nvprof --profile-child-processes python ./myscript.py
Also, nvprof is documented and also has command line help via nvprof --help. Looking at the command-line help, I see a --devices switch which appears to limit at least some functions to use only particular GPUs. You could try it with:
nvprof --devices 0 --profile-child-processes python ./myscript.py
For newer GPUs, nvprof may not be the best profiler choice. You should be able to use nsight systems in a similar fashion, for example via:
nsys profile --stats=true python ....
Additional "newer" profiler resources are linked here.
I am on Ubuntu 12.04 LTS and have installed CUDA 5.5. I understand that without any CUDA/GPGPU elements in the code, nvcc behaves as a C/C++ compiler -- more like gcc, however is there any exception to this rule ? if not, then can I use nvcc as gcc for non-CUDA C/C++ codes ?
No, nvcc doesn't behave like a C/C++ compiler for host code. What it does is the following:
separate device from host code into two separate files
compile device code (with nvcc, cudafe, ptxas)
invoke gcc for host code
If no device code exists, nothing is done in steps 1) and 2). So nvcc is actually no compiler, it is a compiler driver which invokes the right compilers for every part in the right order. To answer your question, if you use nvcc to compile host code only, you still use gcc.
It doesn't accept options to suppress warnings ( -W*)
I am trying to get memory traces from cuda-gdb. However, I am not able to step into the kernel code. I use the nvcc flags -g -G and -keep but to no effect. I am able to put a breakpoint on the kernel function but when I try to access the next instruction, it jump to the end of the kernel function. I have tried this on the sdk examples and I observe the same behaviour. I am working on cuda 5 toolkit. Any suggestions?
Thanks!
This behavior is typical for kernel launch failure. Make sure you check return codes of the CUDA calls. Note that for debugging you may want to add additional call cudaDeviceSynchronize immediately after the kernel call and to check the return code from this call - it is the most precise way to obtain the cause of the asynchronous kernel launch failure.
Update: The code running outside of debugger but not in cuda-gdb most often is caused by trying to debug on a single-GPU system from graphical environment. cuda-gdb cannot share GPU with Xwindows as this would hang the OS.
You need to exit the graphical environment (e.g. quit X window) and debug from the console if your system only has one GPU.
If you have a multi-GPU system, then you should check your Xwindow configuration (Xorg.conf) so it does not use the GPU you reserve for debugging.
I heard that it is better to compile CUDA kernels separately from host code. How do I do that with cmake? I am an absolute beginner in Cmake.
Thanks
As I know it is not possible, unless you do some hacks, with only one cmake command. You could write two CMakeLists.txt one for the CUDA code and one for host code. In the file for the host code you can add the CUDA stuff as library. After that you can write a shell script which executes the two commands for each CMakeLists.txt.
I usually do this with make. There I have two targets for CUDA and host code each compiled into an object file. A third target executes the others and then links the object files to an executable.
I am getting this warning during compilation of a C code with OpenMP directives on Linux:
warning: ignoring #pragma omp parallel
Gcc version is 4.4.
Is it only a warning I should not care about? Will the execution be in parallel?. I would like a solution with a some explanation.
I have provide -fopenmp with the make command, but gcc doesn't accept that, otherwise for single compilation of file, i.e. gcc -fopenmp works alright.
IIRC you have to pass -fopenmp to the g++ call to actually enable OpenMP. This will also link against the OpenMP runtime system.
Make sure that lib-gomp and lib-gomp-dev is installed. In some strange distributions it is removed. It is the essential runtime and development library.
This is probably a resolved/closed issue, because indeed the most common reason for this warning is the omission of the -fopenmp flag.
However, when I came across this problem the root cause for this was that the module openmp was not loaded, meaning:
load module openmp.