every one,
I am trying to use NVCC to compile the following code that uses FFTW3.3 library:
#include <stdio.h>
#include <fftw3.h>
void main() {
fftwf_complex a;
a[0] = 1;
a[1] = -1;
printf("a = %f %f, Testing FFTW with NVCC\n", a[0], a[1]);
}
When I compile using gcc it works fine:
cc main.cpp -o main.out -lfftw3 -lm
main.out
a = 1.000000 -1.000000, Testing FFTW with CUDA
However, when I am trying to compile the same code as .cu file, using nvcc instead of gcc,
I get a long list of compile errors:
nvcc main.cu -o main.out -lfftw3 -lm
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
...
Removing the two libraries -lfftw3 -lm would result in an undefined symbol of fftwf_complex.
Can anyone figure out what's going on?
This is a known problem in FFTW 3.3 whereby the FFTW headers misidentify that they are being compiled with a gcc version >=4.6 which has 128bit floating point support. It has been reported to effect compilation with icc, and it looks like nvcc steered compilation has the same problem.
The recommended workaround is to upgrade to FFTW 3.3.2.
Related
Consider the following CUDA program, in a file named foo.cu:
#include <cooperative_groups.h>
#include <stdio.h>
__global__ void my_kernel() {
auto g = cooperative_groups::this_grid();
g.sync();
}
int main(int, char **) {
cudaLaunchCooperativeKernel( (const void*) my_kernel, 2, 2, nullptr, 0, nullptr);
cudaDeviceSynchronize();
}
This program needs to be compiled with -rdc=true (see this question); and needs to be explicitly linked against libcudadevrt. Ok, no problem... or is it?
$ nvcc -rdc=true -o foo -gencode arch=compute_61,code=sm_61 foo.cu -lcudadevrt
nvlink error : Undefined reference to 'cudaCGGetIntrinsicHandle' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
nvlink error : Undefined reference to 'cudaCGSynchronizeGrid' in '/tmp/tmpxft_000036ec_00000000-10_foo.o'
Only if I explicitly add the library's folder with -L/usr/lib/x86_64-linux-gnu, is it willing to build my program.
This is strange, because all of the CUDA libraries on my system are in that folder. Why isn't NVCC/nvlink looking in there?
Notes:
I'm using Devuan GNU/Linux 3.0.
CUDA 10.1 is installed as a distribution package.
An x86_64 machine with a GeForce 1050 Ti card.
NVCC, or perhaps nvlink, looks for paths in an environment variable named LIBRARIES. But - before doing so, the shell script /etc/nvcc.profile is executed (at least, it is on Devuan).
On Devuan 3.0, that file has a line saying:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu/stubs
so that's where your NVCC looks to by default.
You can therefore do one of two things:
Set the environment variable outside NVCC, e.g. in your ~/.profile or ~/.bashrc file:
export LIBRARIES=-L/usr/lib/x86_64-linux-gnu/
Change that nvcc.profile line to say:
LIBRARIES =+ $(_SPACE_) -L/usr/lib/x86_64-linux-gnu -L/usr/lib/x86_64-linux-gnu/stubs
and NVCC will successfully build your binary.
I tried running an executable which uses separable compilation on a GTX 1080 today (Compute Capability 6.1 which is not directly supported by CUDA 7.5), and wasn't able to run it, as the first CUDA call fails. I have traced it down to cublas, as this simple program (which doesn't even use cublas)
#include <cuda_runtime_api.h>
#include <cstdio>
__global__ void foo()
{
}
int main(int, char**)
{
void * data = nullptr;
auto err = cudaMalloc(&data, 256);
printf("%s\n", cudaGetErrorString(err));
return 0;
}
fails (outputs "unknown error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 -lcublas_device main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas_device.lib
And works (outputs "no error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib
Even if built using the CUDA 8 release candidate, and compute_61 instead, it still fails as long as cublas_device.lib is linked.
Analysis of the simpleDevLibCublas example shows that the example is built for a set of real architectures (sm_xx), and not for virtual architectures (compute_xx), therefore the example in CUDA 7.5 does not run on newer cards. Furthermore, the same example in CUDA 8RC only includes one additional architecture, sm_60. Which is only used by the P100. However, that example does run on 6.1 cards such as the GTX 1080 as well. Support for the sm_61 architecture is not included in Cublas even in CUDA 8RC.
Therefore, the program will work if built using --gpu-architecture=sm_60 even if linking cublas_device, but will not work with --gpu-architecture=compute_60, --gpu-architecture=sm_61 or --gpu-architecture=compute_61. Or any --gpu-architecture=compute_xx for that matter.
I'm a beginner when it comes to CUDA programming, but this situation doesn't look complex, yet it doesn't work.
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
__global__ void add(int *t)
{
t[2] = t[0] + t[1];
}
int main(int argc, char **argv)
{
int sum_cpu[3], *sum_gpu;
sum_cpu[0] = 1;
sum_cpu[1] = 2;
sum_cpu[2] = 0;
cudaMalloc((void**)&sum_gpu, 3 * sizeof(int));
cudaMemcpy(sum_gpu, sum_cpu, 3 * sizeof(int), cudaMemcpyHostToDevice);
add<<<1, 1>>>(sum_gpu);
cudaMemcpy(sum_cpu, sum_gpu, 3 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << sum_cpu[2];
cudaFree(sum_gpu);
return 0;
}
I'm compiling it like this
nvcc main.cu
It compiles, but the returned value is 0. I tried printing from within the kernel and it won't print so I assume i doesn't execute. Can you explain why?
I checked your code and everything is fine. It seems to me, that you are compiling it wrong (assuming you installed the CUDA SDK properly). Maybe you are missing some flags... That's a bit complicated in the beginning I think. Just check which compute capability your GPU has.
As a best practice I am using a Makefile for each of my CUDA projects. It is very easy to use when you first correctly set up your paths. A simplified version looks like this:
NAME=base
# Compilers
NVCC = nvcc
CC = gcc
LINK = nvcc
CUDA_INCLUDE=/opt/cuda
CUDA_LIBS= -lcuda -lcudart
SDK_INCLUDE=/opt/cuda/include
# Flags
COMMONFLAGS =-O2 -m64
NVCCFLAGS =-gencode arch=compute_20,code=sm_20 -m64 -O2
CXXFLAGS =
CFLAGS =
INCLUDES = -I$(CUDA_INCLUDE)
LIBS = $(CUDA_LIBS)
ALL_CCFLAGS :=
ALL_CCFLAGS += $(NVCCFLAGS)
ALL_CCFLAGS += $(addprefix -Xcompiler ,$(COMMONFLAGS))
OBJS = cuda_base.o
# Build rules
.DEFAULT: all
all: $(OBJS)
$(LINK) -o $(NAME) $(LIBS) $(OBJS)
%.o: %.cu
$(NVCC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.c
$(NVCC) -ccbin $(CC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.cpp
$(NVCC) -ccbin $(CXX) -c $(ALL_CCFLAGS) $(INCLUDES) $<
clean:
rm $(OBJS) $(NAME)
Explanation
I am using Arch Linux x64
the code is stored in a file called cuda_base.cu
the path to my CUDA SDK is /opt/cuda (maybe you have a different path)
most important: Which compute capability has your card? Mine is a GTX 580 with maximum compute capability 2.0. So I have to set as an NVCC flag arch=compute_20,code=sm_20, which stands for compute capability 2.0
The Makefile needs to be stored besides cuda_base.cu. I just copy & pasted your code into this file, then typed in the shell
$ make
nvcc -c -gencode arch=compute_20,code=sm_20 -m64 -O2 -Xcompiler -O2 -Xcompiler -m64 -I/opt/cuda cuda_base.cu
nvcc -o base -lcuda -lcudart cuda_base.o
$ ./base
3
and got your result.
Me and a friend of mine created a base template for writing CUDA code. You can find it here if you like.
Hope this helps ;-)
I've had the exact same problems. I've tried the vector sum example from 'CUDA by example', Sanders & Kandrot. I typed in the code, added the vectors together, out came zeros.
CUDA doesn't print error messages to the console, and only returns error codes from the functions like CUDAMalloc and CUDAMemcpy. In my desire to get a working example, I didn't check the error codes. A basic mistake. So, when I ran the version which loads up when I start a new CUDA project in Visual Studio, and which does do error checking, bingo! an error. The error message was 'invalid device function'.
Checking out the compute capability of my card, using the program in the book or equivalent, indicated that it was...
... wait for it...
1.1
So, I changed the compile options. In Visual Studio 13, Project -> Properties -> Configuration Properties -> CUDA C/C++ -> Device -> Code Generation.
I changed the item from compute_20,sm_20 to compute_11,sm_11. This indicates that the compute capability is 1.1 rather than the assumed 2.0.
Now, the rebuilt code works as expected.
I hope that is useful.
In my library I need to support devices of compute capability 2.0 and higher. For CC 3.5+ devices I’ve implemented optimized kernels which utilize Dynamic Parallelism. It seems that nvcc compiler does not support DP when anything less than “compute_35,sm_35” is specified (I'm getting compiler/linker errors). My question is what is the best way to support multiple kernel versions in such case? Having multiple DLLs and choosing between them at runtime will work but I was wondering if there is a better way.
UPDATE: I’m successfully using #if __CUDA_ARCH__ >= 350 for other things (like __ldg() etc) but it does not work in DP case as I have to link with cudadevrt.lib which produces the following error:
1>nvlink : fatal error : could not find compatible device code in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v5.5/lib/Win32/cudadevrt.lib
I believe this issue has been addressed now in CUDA 6.
In particular, the compile problem associated with having the -lcudadevrt library specified and throwing a link error for code that is not requiring dynamic parallelism, has been eliminated/removed.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.
I'm trying my hand at using the relocatable-device-code flag. I have a large project that would be easier to maintain with small blocks of code.
I was able to get the project to compile. When trying to run it, I get a hard crash. When using the debugger:
(gdb) where
#0 0x0000000000000001 in ?? ()
#1 0x00007fffffffe39c in ?? ()
#2 0x0000000000000000 in ?? ()
I've never seen a stack trace like that! I then reduced the amount of code until I came down to a singularity: main.cu file contains only
#include <iostream>
int main(void) {
std::cout << "hello, world" << std::endl;
return 0;
}
Which still fails. I'm using the following flags to compile my main.cu file.
nvcc -shared -rdc=true -arch=sm_20 -Xcompiler -fPIC -g -G
Do these make sense? Why the segmentation fault for such a simple progam?
Remove the -shared switch. That option is not applicable when you are trying to generate an executable.
From the documentation:
Generate a shared library during linking. Note: when other linker options are required for controlling dll generation, use option -Xlinker.