CUDA code fails on Pascal cards (GTX 1080) - cuda

I tried running an executable which uses separable compilation on a GTX 1080 today (Compute Capability 6.1 which is not directly supported by CUDA 7.5), and wasn't able to run it, as the first CUDA call fails. I have traced it down to cublas, as this simple program (which doesn't even use cublas)
#include <cuda_runtime_api.h>
#include <cstdio>
__global__ void foo()
{
}
int main(int, char**)
{
void * data = nullptr;
auto err = cudaMalloc(&data, 256);
printf("%s\n", cudaGetErrorString(err));
return 0;
}
fails (outputs "unknown error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 -lcublas_device main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas_device.lib
And works (outputs "no error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib
Even if built using the CUDA 8 release candidate, and compute_61 instead, it still fails as long as cublas_device.lib is linked.

Analysis of the simpleDevLibCublas example shows that the example is built for a set of real architectures (sm_xx), and not for virtual architectures (compute_xx), therefore the example in CUDA 7.5 does not run on newer cards. Furthermore, the same example in CUDA 8RC only includes one additional architecture, sm_60. Which is only used by the P100. However, that example does run on 6.1 cards such as the GTX 1080 as well. Support for the sm_61 architecture is not included in Cublas even in CUDA 8RC.
Therefore, the program will work if built using --gpu-architecture=sm_60 even if linking cublas_device, but will not work with --gpu-architecture=compute_60, --gpu-architecture=sm_61 or --gpu-architecture=compute_61. Or any --gpu-architecture=compute_xx for that matter.

Related

Launch kernel from another kernel in CUDA [duplicate]

I am doing dynamic parallelism programming using CUDA 5.5 and an NVDIA GeForce GTX 780 whose compute capability is 3.5. I am calling a kernel function inside a kernel function but it is giving me an error:
error : calling a __global__ function("kernel_6") from a __global__ function("kernel_5") is only allowed on the compute_35 architecture or above
What am I doing wrong?
You can do something like this
nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt
or
If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation.
nvcc -arch=sm_35 -dc simple1.cu
nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt
g++ -c test.c
g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ -lcudart
The same is explained in the cuda programming guide
From Visual Studio 2010:
1) View -> Property Pages
2) Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code -> Yes (-rdc=true)
3) Configuration Properties -> CUDA C/C++ -> Device -> Code Generation -> compute_35,sm_35
4) Configuration Properties -> Linker -> Input -> Additional Dependencies -> cudadevrt.lib
You need to let nvcc generate CC 3.5 code for your device. This can be done by adding this option to nvcc command line.
-gencode arch=compute_35,code=sm_35
You may find the CUDA samples on dynamic parallelism for more detail. They contain both command line options and project settings for all supported OS.
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-quicksort--cuda-dynamic-parallelism-

compute capability and calling a kernel from a kernel [duplicate]

I am doing dynamic parallelism programming using CUDA 5.5 and an NVDIA GeForce GTX 780 whose compute capability is 3.5. I am calling a kernel function inside a kernel function but it is giving me an error:
error : calling a __global__ function("kernel_6") from a __global__ function("kernel_5") is only allowed on the compute_35 architecture or above
What am I doing wrong?
You can do something like this
nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt
or
If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation.
nvcc -arch=sm_35 -dc simple1.cu
nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt
g++ -c test.c
g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ -lcudart
The same is explained in the cuda programming guide
From Visual Studio 2010:
1) View -> Property Pages
2) Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code -> Yes (-rdc=true)
3) Configuration Properties -> CUDA C/C++ -> Device -> Code Generation -> compute_35,sm_35
4) Configuration Properties -> Linker -> Input -> Additional Dependencies -> cudadevrt.lib
You need to let nvcc generate CC 3.5 code for your device. This can be done by adding this option to nvcc command line.
-gencode arch=compute_35,code=sm_35
You may find the CUDA samples on dynamic parallelism for more detail. They contain both command line options and project settings for all supported OS.
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-quicksort--cuda-dynamic-parallelism-

Kernel seem not to execute

I'm a beginner when it comes to CUDA programming, but this situation doesn't look complex, yet it doesn't work.
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
__global__ void add(int *t)
{
t[2] = t[0] + t[1];
}
int main(int argc, char **argv)
{
int sum_cpu[3], *sum_gpu;
sum_cpu[0] = 1;
sum_cpu[1] = 2;
sum_cpu[2] = 0;
cudaMalloc((void**)&sum_gpu, 3 * sizeof(int));
cudaMemcpy(sum_gpu, sum_cpu, 3 * sizeof(int), cudaMemcpyHostToDevice);
add<<<1, 1>>>(sum_gpu);
cudaMemcpy(sum_cpu, sum_gpu, 3 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << sum_cpu[2];
cudaFree(sum_gpu);
return 0;
}
I'm compiling it like this
nvcc main.cu
It compiles, but the returned value is 0. I tried printing from within the kernel and it won't print so I assume i doesn't execute. Can you explain why?
I checked your code and everything is fine. It seems to me, that you are compiling it wrong (assuming you installed the CUDA SDK properly). Maybe you are missing some flags... That's a bit complicated in the beginning I think. Just check which compute capability your GPU has.
As a best practice I am using a Makefile for each of my CUDA projects. It is very easy to use when you first correctly set up your paths. A simplified version looks like this:
NAME=base
# Compilers
NVCC = nvcc
CC = gcc
LINK = nvcc
CUDA_INCLUDE=/opt/cuda
CUDA_LIBS= -lcuda -lcudart
SDK_INCLUDE=/opt/cuda/include
# Flags
COMMONFLAGS =-O2 -m64
NVCCFLAGS =-gencode arch=compute_20,code=sm_20 -m64 -O2
CXXFLAGS =
CFLAGS =
INCLUDES = -I$(CUDA_INCLUDE)
LIBS = $(CUDA_LIBS)
ALL_CCFLAGS :=
ALL_CCFLAGS += $(NVCCFLAGS)
ALL_CCFLAGS += $(addprefix -Xcompiler ,$(COMMONFLAGS))
OBJS = cuda_base.o
# Build rules
.DEFAULT: all
all: $(OBJS)
$(LINK) -o $(NAME) $(LIBS) $(OBJS)
%.o: %.cu
$(NVCC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.c
$(NVCC) -ccbin $(CC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.cpp
$(NVCC) -ccbin $(CXX) -c $(ALL_CCFLAGS) $(INCLUDES) $<
clean:
rm $(OBJS) $(NAME)
Explanation
I am using Arch Linux x64
the code is stored in a file called cuda_base.cu
the path to my CUDA SDK is /opt/cuda (maybe you have a different path)
most important: Which compute capability has your card? Mine is a GTX 580 with maximum compute capability 2.0. So I have to set as an NVCC flag arch=compute_20,code=sm_20, which stands for compute capability 2.0
The Makefile needs to be stored besides cuda_base.cu. I just copy & pasted your code into this file, then typed in the shell
$ make
nvcc -c -gencode arch=compute_20,code=sm_20 -m64 -O2 -Xcompiler -O2 -Xcompiler -m64 -I/opt/cuda cuda_base.cu
nvcc -o base -lcuda -lcudart cuda_base.o
$ ./base
3
and got your result.
Me and a friend of mine created a base template for writing CUDA code. You can find it here if you like.
Hope this helps ;-)
I've had the exact same problems. I've tried the vector sum example from 'CUDA by example', Sanders & Kandrot. I typed in the code, added the vectors together, out came zeros.
CUDA doesn't print error messages to the console, and only returns error codes from the functions like CUDAMalloc and CUDAMemcpy. In my desire to get a working example, I didn't check the error codes. A basic mistake. So, when I ran the version which loads up when I start a new CUDA project in Visual Studio, and which does do error checking, bingo! an error. The error message was 'invalid device function'.
Checking out the compute capability of my card, using the program in the book or equivalent, indicated that it was...
... wait for it...
1.1
So, I changed the compile options. In Visual Studio 13, Project -> Properties -> Configuration Properties -> CUDA C/C++ -> Device -> Code Generation.
I changed the item from compute_20,sm_20 to compute_11,sm_11. This indicates that the compute capability is 1.1 rather than the assumed 2.0.
Now, the rebuilt code works as expected.
I hope that is useful.

FFTW 3.3 compile error using NVCC on Linux

every one,
I am trying to use NVCC to compile the following code that uses FFTW3.3 library:
#include <stdio.h>
#include <fftw3.h>
void main() {
fftwf_complex a;
a[0] = 1;
a[1] = -1;
printf("a = %f %f, Testing FFTW with NVCC\n", a[0], a[1]);
}
When I compile using gcc it works fine:
cc main.cpp -o main.out -lfftw3 -lm
main.out
a = 1.000000 -1.000000, Testing FFTW with CUDA
However, when I am trying to compile the same code as .cu file, using nvcc instead of gcc,
I get a long list of compile errors:
nvcc main.cu -o main.out -lfftw3 -lm
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
/usr/include/fftw3.h(370): error: identifier "__float128" is undefined
...
Removing the two libraries -lfftw3 -lm would result in an undefined symbol of fftwf_complex.
Can anyone figure out what's going on?
This is a known problem in FFTW 3.3 whereby the FFTW headers misidentify that they are being compiled with a gcc version >=4.6 which has 128bit floating point support. It has been reported to effect compilation with icc, and it looks like nvcc steered compilation has the same problem.
The recommended workaround is to upgrade to FFTW 3.3.2.

Dynamic Parallelism and CC 2.0 code in the same library

In my library I need to support devices of compute capability 2.0 and higher. For CC 3.5+ devices I’ve implemented optimized kernels which utilize Dynamic Parallelism. It seems that nvcc compiler does not support DP when anything less than “compute_35,sm_35” is specified (I'm getting compiler/linker errors). My question is what is the best way to support multiple kernel versions in such case? Having multiple DLLs and choosing between them at runtime will work but I was wondering if there is a better way.
UPDATE: I’m successfully using #if __CUDA_ARCH__ >= 350 for other things (like __ldg() etc) but it does not work in DP case as I have to link with cudadevrt.lib which produces the following error:
1>nvlink : fatal error : could not find compatible device code in C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v5.5/lib/Win32/cudadevrt.lib
I believe this issue has been addressed now in CUDA 6.
In particular, the compile problem associated with having the -lcudadevrt library specified and throwing a link error for code that is not requiring dynamic parallelism, has been eliminated/removed.
Here's my simple test:
$ cat t264.cu
#include <stdio.h>
__global__ void kernel1(){
printf("Hello from DP Kernel\n");
}
__global__ void kernel2(){
#if __CUDA_ARCH__ >= 350
kernel1<<<1,1>>>();
#else
printf("Hello from non-DP Kernel\n");
#endif
}
int main(){
kernel2<<<1,1>>>();
cudaDeviceSynchronize();
return 0;
}
$ nvcc -O3 -gencode arch=compute_20,code=sm_20 -gencode arch=compute_35,code=sm_35 -rdc=true -o t264 t264.cu -lcudadevrt
$ CUDA_VISIBLE_DEVICES="0" ./t264
Hello from non-DP Kernel
$ CUDA_VISIBLE_DEVICES="1" ./t264
Hello from DP Kernel
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Sat_Jan_25_17:33:19_PST_2014
Cuda compilation tools, release 6.0, V6.0.1
$
In my case, device 0 is a Quadro5000, a cc 2.0 device, and device 1 is a GeForce GT 640, a cc 3.5 device.