Dynamic Parallelism and CC 2.0 code in the same library - cuda

In my library I need to support devices of compute capability 2.0 and higher. For CC 3.5+ devices I’ve implemented optimized kernels which utilize Dynamic Parallelism. It seems that nvcc compiler does not support DP when anything less than “compute_35,sm_35” is specified (I'm getting compiler/linker errors). My question is what is the best way to support multiple kernel versions in such case? Having multiple DLLs and choosing between them at runtime will work but I was wondering if there is a better way.
UPDATE: I’m successfully using #if __CUDA_ARCH__ >= 350 for other things (like __ldg() etc) but it does not work in DP case as I have to link with cudadevrt.lib which produces the following error:
1>nvlink : fatal error : could not find compatible device code in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v5.5/lib/Win32/cudadevrt.lib

I believe this issue has been addressed now in CUDA 6.
In particular, the compile problem associated with having the -lcudadevrt library specified and throwing a link error for code that is not requiring dynamic parallelism, has been eliminated/removed.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.

Related

NVCC won't look for libraries in /usr/lib/x86_64-linux-gnu - why?

Consider the following CUDA program, in a file named foo.cu:
#include <cooperative_groups.h>
#include <stdio.h>
__global__ void my_kernel() {
auto g = cooperative_groups::this_grid();
g.sync();
}
int main(int, char **) {
cudaLaunchCooperativeKernel( (const void*) my_kernel, 2, 2, nullptr, 0, nullptr);
cudaDeviceSynchronize();
}
This program needs to be compiled with -rdc=true (see this question); and needs to be explicitly linked against libcudadevrt. Ok, no problem... or is it?
$ nvcc -rdc=true -o foo -gencode arch=compute_61,code=sm_61 foo.cu -lcudadevrt
nvlink error : Undefined reference to 'cudaCGGetIntrinsicHandle' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
nvlink error : Undefined reference to 'cudaCGSynchronizeGrid' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
Only if I explicitly add the library's folder with -L/usr/lib/x86_64-linux-gnu, is it willing to build my program.
This is strange, because all of the CUDA libraries on my system are in that folder. Why isn't NVCC/nvlink looking in there?
Notes:
I'm using Devuan GNU/Linux 3.0.
CUDA 10.1 is installed as a distribution package.
An x86_64 machine with a GeForce 1050 Ti card.
NVCC, or perhaps nvlink, looks for paths in an environment variable named LIBRARIES. But - before doing so, the shell script /etc/nvcc.profile is executed (at least, it is on Devuan).
On Devuan 3.0, that file has a line saying:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu/stubs
so that's where your NVCC looks to by default.
You can therefore do one of two things:
Set the environment variable outside NVCC, e.g. in your ~/.profile or ~/.bashrc file:
export LIBRARIES=-L/usr/lib/x86_64-linux-gnu/
Change that nvcc.profile line to say:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu/stubs
and NVCC will successfully build your binary.

CUDA Kernel does not launch, CMAKE Visual Studio 2015 project

I have a relatively simple CUDA kernel and I immediately call the kernel in the main method of my program in the following way:
__global__ void block() {
for (int i = 0; i < 20; i++) {
printf("a");
}
}
int main(int argc, char** argv) {
block << <1, 1 >> > ();
cudaError_t cudaerr = cudaDeviceSynchronize();
printf("Kernel executed!\n");
if (cudaerr != cudaSuccess)
printf("kernel launch failed with error \"%s\".\n",
cudaGetErrorString(cudaerr));
}
This program is compiled and launched using Visual Studio 2015, and the project being executed has been generated with CMAKE using the following CMakeLists.txt file:
project (Comparison)
cmake_minimum_required (VERSION 2.6)
find_package(CUDA REQUIRED)
set(
CUDA_NVCC_FLAGS
${CUDA_NVCC_FLAGS};
-arch=compute_30 -code=sm_30 -g -G
)
cuda_add_executable(Comparison kernel.cu)
I would expect the output of this program to print 20 A's to the console and then end with printing kernel executed. However, the A's are never printed to the console and the line Kernel executed shows up immediately. Even if I replace the for loop by a while(true) loop.
Even when running the code with the Nsight debugger attached and a breakpoint in the for loop of the kernel nothing happens. Leading me to believe that the kernel is never actually launched. Does anyone know how to make this kernel behave as expected?
The reason the kernel was not running correctly when compiled with the given CMakeLists.txt file was due to these flags:
-arch=compute_30 -code=sm_30
combined with the GPU that was being used (GTX 970, a cc 5.2 GPU).
Those flags specify the generation of cc 3.0 SASS code only, and such code is not compatible with a cc 5.2 device. The fix would be to modify the flags to something like:
-arch=sm_30
or
-arch=sm_52
or
-arch=compute_52 -code=sm_52
I would recommend the first or second approach, as it will include PTX support for future devices.
The kernel error was not evident because the error checking after the kernel was incomplete. Refer to the canonical/question answer.

CUDA code fails on Pascal cards (GTX 1080)

I tried running an executable which uses separable compilation on a GTX 1080 today (Compute Capability 6.1 which is not directly supported by CUDA 7.5), and wasn't able to run it, as the first CUDA call fails. I have traced it down to cublas, as this simple program (which doesn't even use cublas)
#include <cuda_runtime_api.h>
#include <cstdio>
__global__ void foo()
{
}
int main(int, char**)
{
void * data = nullptr;
auto err = cudaMalloc(&data, 256);
printf("%s\n", cudaGetErrorString(err));
return 0;
}
fails (outputs "unknown error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 -lcublas_device main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas_device.lib
And works (outputs "no error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib
Even if built using the CUDA 8 release candidate, and compute_61 instead, it still fails as long as cublas_device.lib is linked.
Analysis of the simpleDevLibCublas example shows that the example is built for a set of real architectures (sm_xx), and not for virtual architectures (compute_xx), therefore the example in CUDA 7.5 does not run on newer cards. Furthermore, the same example in CUDA 8RC only includes one additional architecture, sm_60. Which is only used by the P100. However, that example does run on 6.1 cards such as the GTX 1080 as well. Support for the sm_61 architecture is not included in Cublas even in CUDA 8RC.
Therefore, the program will work if built using --gpu-architecture=sm_60 even if linking cublas_device, but will not work with --gpu-architecture=compute_60, --gpu-architecture=sm_61 or --gpu-architecture=compute_61. Or any --gpu-architecture=compute_xx for that matter.

Launch kernel from another kernel in CUDA [duplicate]

I am doing dynamic parallelism programming using CUDA 5.5 and an NVDIA GeForce GTX 780 whose compute capability is 3.5. I am calling a kernel function inside a kernel function but it is giving me an error:
error : calling a __global__ function("kernel_6") from a __global__ function("kernel_5") is only allowed on the compute_35 architecture or above
What am I doing wrong?
You can do something like this
nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt
or
If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation.
nvcc -arch=sm_35 -dc simple1.cu
nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt
g++ -c test.c
g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ -lcudart
The same is explained in the cuda programming guide
From Visual Studio 2010:
1) View -> Property Pages
2) Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code -> Yes (-rdc=true)
3) Configuration Properties -> CUDA C/C++ -> Device -> Code Generation -> compute_35,sm_35
4) Configuration Properties -> Linker -> Input -> Additional Dependencies -> cudadevrt.lib
You need to let nvcc generate CC 3.5 code for your device. This can be done by adding this option to nvcc command line.
-gencode arch=compute_35,code=sm_35
You may find the CUDA samples on dynamic parallelism for more detail. They contain both command line options and project settings for all supported OS.
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-quicksort--cuda-dynamic-parallelism-

compute capability and calling a kernel from a kernel [duplicate]

I am doing dynamic parallelism programming using CUDA 5.5 and an NVDIA GeForce GTX 780 whose compute capability is 3.5. I am calling a kernel function inside a kernel function but it is giving me an error:
error : calling a __global__ function("kernel_6") from a __global__ function("kernel_5") is only allowed on the compute_35 architecture or above
What am I doing wrong?
You can do something like this
nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt
or
If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation.
nvcc -arch=sm_35 -dc simple1.cu
nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt
g++ -c test.c
g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ -lcudart
The same is explained in the cuda programming guide
From Visual Studio 2010:
1) View -> Property Pages
2) Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code -> Yes (-rdc=true)
3) Configuration Properties -> CUDA C/C++ -> Device -> Code Generation -> compute_35,sm_35
4) Configuration Properties -> Linker -> Input -> Additional Dependencies -> cudadevrt.lib
You need to let nvcc generate CC 3.5 code for your device. This can be done by adding this option to nvcc command line.
-gencode arch=compute_35,code=sm_35
You may find the CUDA samples on dynamic parallelism for more detail. They contain both command line options and project settings for all supported OS.
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-quicksort--cuda-dynamic-parallelism-