separate compilaton in CUDA - cuda

System specs: laptop with nvidia optimus support (geforce 740m, supports compute capability 2.0), ubuntu 13.10, cuda 5.0, optirun (Bumblebee) 3.2.1.
Im' trying to compile and run simpler version of example described here:
main.cu
#include "defines.h"
#include <cuda.h>
int main ()
{
hello<<<1, 1>>>();
cudaDeviceSynchronize();
}
defines.h
#include <cuda.h>
extern __global__ void hello(void);
defines.cu
#include <cstdio>
#include <cuda.h>
__global__ void hello()
{
printf("Hello!\n");
}
using:
nvcc –arch=sm_20 –dc main.cu defines.cu
nvcc –arch=sm_20 main.o defines.o
When I try to run output a.out file using:
optirun ./a.out
I get no "Hello!" in console. What can be the problem?

This is not right:
nvcc –arch=sm_20 –dc main.cu defines.cu
nvcc –arch=sm_20 main.cu defines.cu
The first command performs compilation (but no linking) in separate compilation mode.
The second command performs compilation and linking in one step, but without using separate compilation mode.
Try just this:
nvcc -arch=sm_20 -rdc=true main.cu defines.cu
The relevant nvcc documentation is here
Alternatively, and following the example you linked, you could also do this:
nvcc –arch=sm_20 –dc main.cu defines.cu
nvcc –arch=sm_20 main.o defines.o
As #JackOLantern points out, you may wish to replace sm_20 in the above commands with sm_30 to match your device, but that is not the reason for the failure you are observing. Code compiled for -arch=sm_20 will run on a cc 3.0 device.

Related

NVCC won't look for libraries in /usr/lib/x86_64-linux-gnu - why?

Consider the following CUDA program, in a file named foo.cu:
#include <cooperative_groups.h>
#include <stdio.h>
__global__ void my_kernel() {
auto g = cooperative_groups::this_grid();
g.sync();
}
int main(int, char **) {
cudaLaunchCooperativeKernel( (const void*) my_kernel, 2, 2, nullptr, 0, nullptr);
cudaDeviceSynchronize();
}
This program needs to be compiled with -rdc=true (see this question); and needs to be explicitly linked against libcudadevrt. Ok, no problem... or is it?
$ nvcc -rdc=true -o foo -gencode arch=compute_61,code=sm_61 foo.cu -lcudadevrt
nvlink error : Undefined reference to 'cudaCGGetIntrinsicHandle' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
nvlink error : Undefined reference to 'cudaCGSynchronizeGrid' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
Only if I explicitly add the library's folder with -L/usr/lib/x86_64-linux-gnu, is it willing to build my program.
This is strange, because all of the CUDA libraries on my system are in that folder. Why isn't NVCC/nvlink looking in there?
Notes:
I'm using Devuan GNU/Linux 3.0.
CUDA 10.1 is installed as a distribution package.
An x86_64 machine with a GeForce 1050 Ti card.
NVCC, or perhaps nvlink, looks for paths in an environment variable named LIBRARIES. But - before doing so, the shell script /etc/nvcc.profile is executed (at least, it is on Devuan).
On Devuan 3.0, that file has a line saying:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu/stubs
so that's where your NVCC looks to by default.
You can therefore do one of two things:
Set the environment variable outside NVCC, e.g. in your ~/.profile or ~/.bashrc file:
export LIBRARIES=-L/usr/lib/x86_64-linux-gnu/
Change that nvcc.profile line to say:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu/stubs
and NVCC will successfully build your binary.

CUDA code fails on Pascal cards (GTX 1080)

I tried running an executable which uses separable compilation on a GTX 1080 today (Compute Capability 6.1 which is not directly supported by CUDA 7.5), and wasn't able to run it, as the first CUDA call fails. I have traced it down to cublas, as this simple program (which doesn't even use cublas)
#include <cuda_runtime_api.h>
#include <cstdio>
__global__ void foo()
{
}
int main(int, char**)
{
void * data = nullptr;
auto err = cudaMalloc(&data, 256);
printf("%s\n", cudaGetErrorString(err));
return 0;
}
fails (outputs "unknown error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 -lcublas_device main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas_device.lib
And works (outputs "no error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib
Even if built using the CUDA 8 release candidate, and compute_61 instead, it still fails as long as cublas_device.lib is linked.
Analysis of the simpleDevLibCublas example shows that the example is built for a set of real architectures (sm_xx), and not for virtual architectures (compute_xx), therefore the example in CUDA 7.5 does not run on newer cards. Furthermore, the same example in CUDA 8RC only includes one additional architecture, sm_60. Which is only used by the P100. However, that example does run on 6.1 cards such as the GTX 1080 as well. Support for the sm_61 architecture is not included in Cublas even in CUDA 8RC.
Therefore, the program will work if built using --gpu-architecture=sm_60 even if linking cublas_device, but will not work with --gpu-architecture=compute_60, --gpu-architecture=sm_61 or --gpu-architecture=compute_61. Or any --gpu-architecture=compute_xx for that matter.

FFTW 3.3 compile error using NVCC on Linux

every one,
I am trying to use NVCC to compile the following code that uses FFTW3.3 library:
#include <stdio.h>
#include <fftw3.h>
void main() {
fftwf_complex a;
a[0] = 1;
a[1] = -1;
printf("a = %f %f, Testing FFTW with NVCC\n", a[0], a[1]);
}
When I compile using gcc it works fine:
cc main.cpp -o main.out -lfftw3 -lm
main.out
a = 1.000000 -1.000000, Testing FFTW with CUDA
However, when I am trying to compile the same code as .cu file, using nvcc instead of gcc,
I get a long list of compile errors:
nvcc main.cu -o main.out -lfftw3 -lm
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
...
Removing the two libraries -lfftw3 -lm would result in an undefined symbol of fftwf_complex.
Can anyone figure out what's going on?
This is a known problem in FFTW 3.3 whereby the FFTW headers misidentify that they are being compiled with a gcc version >=4.6 which has 128bit floating point support. It has been reported to effect compilation with icc, and it looks like nvcc steered compilation has the same problem.
The recommended workaround is to upgrade to FFTW 3.3.2.

Dynamic Parallelism and CC 2.0 code in the same library

In my library I need to support devices of compute capability 2.0 and higher. For CC 3.5+ devices I’ve implemented optimized kernels which utilize Dynamic Parallelism. It seems that nvcc compiler does not support DP when anything less than “compute_35,sm_35” is specified (I'm getting compiler/linker errors). My question is what is the best way to support multiple kernel versions in such case? Having multiple DLLs and choosing between them at runtime will work but I was wondering if there is a better way.
UPDATE: I’m successfully using #if __CUDA_ARCH__ >= 350 for other things (like __ldg() etc) but it does not work in DP case as I have to link with cudadevrt.lib which produces the following error:
1>nvlink : fatal error : could not find compatible device code in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v5.5/lib/Win32/cudadevrt.lib
I believe this issue has been addressed now in CUDA 6.
In particular, the compile problem associated with having the -lcudadevrt library specified and throwing a link error for code that is not requiring dynamic parallelism, has been eliminated/removed.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.

Cuda function pointer consistency [duplicate]

System specs: laptop with nvidia optimus support (geforce 740m, supports compute capability 2.0), ubuntu 13.10, cuda 5.0, optirun (Bumblebee) 3.2.1.
Im' trying to compile and run simpler version of example described here:
main.cu
#include "defines.h"
#include <cuda.h>
int main ()
{
hello<<<1, 1>>>();
cudaDeviceSynchronize();
}
defines.h
#include <cuda.h>
extern __global__ void hello(void);
defines.cu
#include <cstdio>
#include <cuda.h>
__global__ void hello()
{
printf("Hello!\n");
}
using:
nvcc –arch=sm_20 –dc main.cu defines.cu
nvcc –arch=sm_20 main.o defines.o
When I try to run output a.out file using:
optirun ./a.out
I get no "Hello!" in console. What can be the problem?
This is not right:
nvcc –arch=sm_20 –dc main.cu defines.cu
nvcc –arch=sm_20 main.cu defines.cu
The first command performs compilation (but no linking) in separate compilation mode.
The second command performs compilation and linking in one step, but without using separate compilation mode.
Try just this:
nvcc -arch=sm_20 -rdc=true main.cu defines.cu
The relevant nvcc documentation is here
Alternatively, and following the example you linked, you could also do this:
nvcc –arch=sm_20 –dc main.cu defines.cu
nvcc –arch=sm_20 main.o defines.o
As #JackOLantern points out, you may wish to replace sm_20 in the above commands with sm_30 to match your device, but that is not the reason for the failure you are observing. Code compiled for -arch=sm_20 will run on a cc 3.0 device.