Trouble compiling/running CUDA code involving dynamic parallelism - cuda

I am trying to use dynamic parallelism with CUDA, but I cannot go through the compilation step.
I am working on a GPU with Compute Capability 3.5 and the CUDA version 7.5.
Depending on the switches in the compile command I use, I am getting different error messages, but using the documentation,
I arrived to one line leading to a successful compilation:
nvcc -arch=compute_35 -rdc=true cudaDynamic.cu -o cudaDynamic.out -lcudadevrt
But when the program is launched, all the program fails. With
CUDA-memcheck, for each call to an API function, I get the same error
message:
========= CUDA-MEMCHECK
========= Program hit cudaErrorUnknown (error 30) due to "unknown error" on CUDA API call to ...
I have also tried this line (taken from CUDA dynamic samples makefile):
nvcc -ccbin g++ -I../../common/inc -m64 -dc -gencode arch=compute_35,code=compute_35 -o cudaDynamic.out -c cudaDynamic.cu
But upon execution, I get:
cudaDynamic.out: Permission denied
I would like to understand how to correctly compile a CUDA dynamic code, because all the other compilation lines that I have tried so far have failed.

I fixed the problem by fully reinstalling CUDA.
I'm now able to compile both the CUDA samples and my own code.

Related

Undefined reference to `cublasCreate_v2’ in ‘/tmp/tmpxft_0000120b_0000000-10_my_program”

I have tried to compile a code using CUDA 9.0 toolkit on NVIDIA Tesla P100 graphic card (Ubuntu version 16.04) and CUBLAS library is used in the code. For compilation, I have used the following command to compile “my_program.cu”
nvcc -std=c++11 -L/usr/local/cuda-9.0/lib64 my_program.cu -o mu_program.o -lcublas
But, I have got the following error:
nvlink error: Undefined reference to 'cublasCreate_v2’in '/tmp/tmpxft_0000120b_0000000-10_my_program’
As I have already linked the library path in the compilation command, why do I still get the error. Please help me to solve this error.
It seems fairly evident that you are trying to use the CUBLAS library in device code. This is different than ordinary host usage and requires special compilation/linking steps. You need to:
compile for the correct device architecture (must be cc3.5 or higher)
use relocatable device code linking
link in the cublas device library (in addition to the cublas host library)
link in the CUDA device runtime library
Use a CUDA toolkit prior to CUDA 10.0
The following additions to your compile command line should get you there:
nvcc -std=c++11 my_program.cu -o my_program.o -lcublas -arch=sm_60 -rdc=true -lcublas_device -lcudadevrt
The above assumes you are actually using a proper install of CUDA 9.0. The CUBLAS device library was deprecated and is now removed from newer CUDA toolkits (see here).

Segmentation fault when compiling Darknet for GPU

I want to compile the Darknet framework for machine learning on my PC with GPU support. However I call make I will get a segmentation fault:
nvcc -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=[sm_50,compute_50] -gencode arch=compute_52,code=[sm_52,compute_52] -Iinclude/ -Isrc/ -DOPENCV `pkg-config --cflags opencv` -DGPU -I/usr/local/cuda/include/ --compiler-options "-Wall -Wno-unused-result -Wno-unknown-pragmas -Wfatal-errors -fPIC -Ofast -DOPENCV -DGPU" -c ./src/convolutional_kernels.cu -o obj/convolutional_kernels.o
Segmentation fault (core dumped)
Makefile:92: recipe for target 'obj/convolutional_kernels.o' failed
make: *** [obj/convolutional_kernels.o] Error 139
nvidia-smi gives me following information:
NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1
When I do nvcc --version I get:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Nov__3_21:07:56_CDT_2017
Cuda compilation tools, release 9.1, V9.1.85
The CUDA Version 10.1 is not the same as the Verions 9.1 of the Cuda compilation tools. Could this be the problem? NVCC is installed via apt install nvidia-cuda-toolkit
Just gonna post my solution here because I figured out the actual reason for this. So the reason this happens is because it's running a different binary than the actual one darknet wants to run. At least for me, which nvcc gave me /usr/bin/nvcc. The actual nvcc you want is located in /usr/local/cuda-11.1/bin (version number might be different obviously). So all you need to do is prepend (important!) that directory to your PATH variable.
export PATH=/usr/local/cuda-11.1/bin${PATH:+:${PATH}} >> ~/.bashrc
Source:https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#post-installation-actions
I recommend you follow the link because there are a couple more mandatory post-installation steps that I also did not follow.
I solved the problem. After installing cuda the actual binary of nvcc is at /usr/local/cuda/bin/nvcc. Creating a symbolic link in /usr/bin/ to this binary solved the problem.
Another approach is to edit the Makefile and set the correct nvcc.
In my case:
line 24 replace
NVCC=nvcc
to
NVCC=/usr/local/cuda-11.0/bin/nvcc
Note that the cuda version may vary.

How to force cubin file generation for a higher compute version

In the samples provided with CUDA 6.0, I'm running the following compile command with error output:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc cdpSimpleQuicksort.cu
nvcc warning : The 'compute_10' and 'sm_10' architectures are deprecated, and may be removed in a future release.
cdpSimpleQuicksort.cu(105): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
cdpSimpleQuicksort.cu(114): error: calling a __global__ function("cdp_simple_quicksort") from a __global__ function("cdp_simple_quicksort") is only allowed on the compute_35 architecture or above
2 errors detected in the compilation of "/tmp/tmpxft_0000241a_00000000-6_cdpSimpleQuicksort.cpp1.ii".
I then altered the command to this, with a new failure:
foo#foo:/usr/local/cuda-6.0/samples/0_Simple/cdpSimpleQuicksort$ nvcc --cubin -I../../common/inc -gencode arch=compute_35,code=sm_35 cdpSimpleQuicksort.cu
cdpSimpleQuicksort.cu(105): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
cdpSimpleQuicksort.cu(114): error: kernel launch from __device__ or __global__ functions requires separate compilation mode
2 errors detected in the compilation of "/tmp/tmpxft_000024f3_00000000-6_cdpSimpleQuicksort.cpp1.ii".
Does this have anything to do with the fact that the machine I'm on is only Compute 2.1 capable and the build tools are blocking me? What's the resolution... I'm not finding anything in the documentation that is clearly handling this error.
I looked at this question, and that... a link to documentation is simply not helping. I need to know how I have to modify the compile command.
Look at the makefile that comes with that cdpSimpleQuicksort project. It shows some additional switches that are needed to compile it, due to CUDA dynamic parallelism (which is essentially the second set of errors you are seeing.) Go back and study that makefile, and see if you can figure out how to combine some of the compile commands there with --cubin.
The readers digest version is that this should compile without error:
nvcc --cubin -rdc=true -I../../common/inc -arch=sm_35 cdpSimpleQuicksort.cu
Having said all that, you should be able to compile for whatever kind of target you want, but you won't be able to run a cdp code on a cc2.1 architecture.
cdp documentation
and here

Cuda-gdb not stopping at breakpoints inside kernel

Cuda-gdb was obeying all the breakpoints I would set, before adding '-arch sm_20' flag while compiling. I had to add this to avoid error being thrown : 'atomicAdd is undefined' (as pointed here). Here is my current statement to compile the code:
nvcc -g -G --maxrregcount=32 Main.cu -o SW_exe (..including header files...) -arch sm_20
and when I set a breakpoint inside kernel, cuda-gdb stops once at the last line of the kernel, and then the program continues.
(cuda-gdb) b SW_kernel_1.cu:49
Breakpoint 1 at 0x4114a0: file ./SW_kernel_1.cu, line 49.
...
[Launch of CUDA Kernel 5 (diagonalComputation<<<(1024,1,1),(128,1,1)>>>) on Device 0]
Breakpoint 1, diagonalComputation (__cuda_0=15386, __cuda_1=128, __cuda_2=0xf00400000, __cuda_3=0xf00200000,
__cuda_4=100, __cuda_5=0xf03fa0000, __cuda_6=0xf04004000, __cuda_7=0xf040a0000, __cuda_8=0xf00200200,
__cuda_9=15258, __cuda_10=5, __cuda_11=-3, __cuda_12=8, __cuda_13=1) at ./SW_kernel_1.cu:183
183 }
(cuda-gdb) c
Continuing.
But as I said, if I remove the 'atomicAdd()' call and the flag '-arch sm_20' which though makes my code incorrect, but now the cuda-gdb stops at the breakpoint I specify. Please tell me the reasons of this behaviour.
I am using CUDA 5.5 on Tesla M2070 (Compute Capability = 2.0).
Thanks!
From the CUDA DEBUGGER User Manual, Section 3.3.1:
NVCC, the NVIDIA CUDA compiler driver, provides a mechanism for generating the
debugging information necessary for CUDA-GDB to work properly. The -g -G option
pair must be passed to NVCC when an application is compiled in order to debug with
CUDA-GDB; for example,
nvcc -g -G foo.cu -o foo
Using this line to compile the CUDA application foo.cu
forces -O0 compilation, with the exception of very limited dead-code eliminations
and register-spilling optimizations.
makes the compiler include debug information in the executable
This means that, in principle, breakpoints could not be hit in kernel functions even when the code is compiled in debug mode since the CUDA compiler can perform some code optimizations and so the disassembled code could not correspond to the CUDA instructions.
When breakpoints are not hit, a workaround is to put a printf statement immediately after the variable one wants to check, as suggested by Robert Crovella at
CUDA debugging with VS - can't examine restrict pointers (Operation is not valid)
The OP has chosen here a different workaround, i.e., to compile for a different architecture. Indeed, the optimization the compiler does can change from architecture to architecture.

fatal error while compiling cuda program

I'm implementing a program by using dynamic parallelism. Whenever I'm compiling the code, it is throwing fatal error as follows:
ptxas fatal : Unresolved extern function 'cudaGetParameterBuffer'
Compiling as below:
nvcc -o dyn_par dyn_par.cu -arch=sm_35
How to resolve it?
The cudaGetParameterBuffer is part of the cudadevrt library which you need to specify in your compiler command and specify --relocatable-device-code as true
nvcc -o dyn_par dyn_par.cu -arch=sm_35 -lcudadevrt --relocatable-device-code true
Have a look at the CUDA Dynamic Parallelism Programming Guide from Nvidia (Page 21 describes the above) for more information