Is it possible to call cublas functions from a device function? - cuda

In here Robert Crovella said that cublas routines can be called from device code. Although I am using dynamic parallelism and compiling with compute capability 3.5, I cannot manage to call Cublas routines from a device function. I always get the error "calling a host function from a device/global function is not allowed" My code contains device functions which call CUBLAS routines like cublsAlloc, cublasGetVector, cublasSetVector and cublasDgemm
My compilation and linking commands:
nvcc -arch=sm_35 -I. -I/usr/local/cuda/include -c -O3 -dc GPUutil.cu -o ./build/GPUutil.o
nvcc -arch=sm_35 -I. -I/usr/local/cuda/include -c -O3 -dc DivideParalelo.cu -o ./build/DivideParalelo.o
nvcc -arch=sm_35 -I. -I/usr/local/cuda/include -dlink ./build/io.o ./build/GPUutil.o ./build/DivideParalelo.o -lcudadevrt -o ./build/link.o
icc -Wwrite-strings ./build/GPUutil.o ./build/DivideParalelo.o ./build/link.o -lcudadevrt -L/usr/local/cuda/lib64 -L~/Intel/composer_xe_2015.0.090/mkl/lib/intel64 -L~/Intel/composer_xe_2015.0.090/mkl/../compiler/lib/intel64 -Wl,--start-group ~/Intel/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_intel_lp64.a ~/Intel/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_sequential.a ~/Intel/composer_xe_2015.0.090/mkl/lib/intel64/libmkl_core.a ~/Intel/composer_xe_2015.0.090/mkl/../compiler/lib/intel64/libiomp5.a -Wl,--end-group -lpthread -lm -lcublas -lcudart -o DivideParalelo

Here you can find all the details about cuBLAS device API, such as:
Starting with release 5.0, the CUDA Toolkit now provides a static cuBLAS Library cublas_device.a that contains device routines with the same API as the regular cuBLAS Library. Those routines use internally the Dynamic Parallelism feature to launch kernel from within and thus is only available for device with compute capability at least equal to 3.5.
In order to use those library routines from the device the user must include the header file “cublas_v2.h” corresponding to the new cuBLAS API and link against the static cuBLAS library cublas_device.a.
If you still experience issues even after reading through the documentation and applying all of the steps described there, then ask for additional assistance.

Related

CUDA code fails on Pascal cards (GTX 1080)

I tried running an executable which uses separable compilation on a GTX 1080 today (Compute Capability 6.1 which is not directly supported by CUDA 7.5), and wasn't able to run it, as the first CUDA call fails. I have traced it down to cublas, as this simple program (which doesn't even use cublas)
#include <cuda_runtime_api.h>
#include <cstdio>
__global__ void foo()
{
}
int main(int, char**)
{
void * data = nullptr;
auto err = cudaMalloc(&data, 256);
printf("%s\n", cudaGetErrorString(err));
return 0;
}
fails (outputs "unknown error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 -lcublas_device main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib cublas_device.lib
And works (outputs "no error") if built using
nvcc -dc --gpu-architecture=compute_52 -m64 main.cu -o main.dc.obj
nvcc -dlink --gpu-architecture=compute_52 -m64 main.dc.obj -o main.obj
link /SUBSYSTEM:CONSOLE /LIBPATH:"%CUDA_PATH%\lib\x64" main.obj main.dc.obj cudart_static.lib cudadevrt.lib
Even if built using the CUDA 8 release candidate, and compute_61 instead, it still fails as long as cublas_device.lib is linked.
Analysis of the simpleDevLibCublas example shows that the example is built for a set of real architectures (sm_xx), and not for virtual architectures (compute_xx), therefore the example in CUDA 7.5 does not run on newer cards. Furthermore, the same example in CUDA 8RC only includes one additional architecture, sm_60. Which is only used by the P100. However, that example does run on 6.1 cards such as the GTX 1080 as well. Support for the sm_61 architecture is not included in Cublas even in CUDA 8RC.
Therefore, the program will work if built using --gpu-architecture=sm_60 even if linking cublas_device, but will not work with --gpu-architecture=compute_60, --gpu-architecture=sm_61 or --gpu-architecture=compute_61. Or any --gpu-architecture=compute_xx for that matter.

Launch kernel from another kernel in CUDA [duplicate]

I am doing dynamic parallelism programming using CUDA 5.5 and an NVDIA GeForce GTX 780 whose compute capability is 3.5. I am calling a kernel function inside a kernel function but it is giving me an error:
error : calling a __global__ function("kernel_6") from a __global__ function("kernel_5") is only allowed on the compute_35 architecture or above
What am I doing wrong?
You can do something like this
nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt
or
If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation.
nvcc -arch=sm_35 -dc simple1.cu
nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt
g++ -c test.c
g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ -lcudart
The same is explained in the cuda programming guide
From Visual Studio 2010:
1) View -> Property Pages
2) Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code -> Yes (-rdc=true)
3) Configuration Properties -> CUDA C/C++ -> Device -> Code Generation -> compute_35,sm_35
4) Configuration Properties -> Linker -> Input -> Additional Dependencies -> cudadevrt.lib
You need to let nvcc generate CC 3.5 code for your device. This can be done by adding this option to nvcc command line.
-gencode arch=compute_35,code=sm_35
You may find the CUDA samples on dynamic parallelism for more detail. They contain both command line options and project settings for all supported OS.
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-quicksort--cuda-dynamic-parallelism-

compute capability and calling a kernel from a kernel [duplicate]

I am doing dynamic parallelism programming using CUDA 5.5 and an NVDIA GeForce GTX 780 whose compute capability is 3.5. I am calling a kernel function inside a kernel function but it is giving me an error:
error : calling a __global__ function("kernel_6") from a __global__ function("kernel_5") is only allowed on the compute_35 architecture or above
What am I doing wrong?
You can do something like this
nvcc -arch=sm_35 -rdc=true simple1.cu -o simple1 -lcudadevrt
or
If you have 2 files simple1.cu and test.c then you can do something as below. This is called seperate compilation.
nvcc -arch=sm_35 -dc simple1.cu
nvcc -arch=sm_35 -dlink simple1.o -o link.o -lcudadevrt
g++ -c test.c
g++ link.o simple1.o test.o -o simple -L/usr/local/cuda/lib64/ -lcudart
The same is explained in the cuda programming guide
From Visual Studio 2010:
1) View -> Property Pages
2) Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code -> Yes (-rdc=true)
3) Configuration Properties -> CUDA C/C++ -> Device -> Code Generation -> compute_35,sm_35
4) Configuration Properties -> Linker -> Input -> Additional Dependencies -> cudadevrt.lib
You need to let nvcc generate CC 3.5 code for your device. This can be done by adding this option to nvcc command line.
-gencode arch=compute_35,code=sm_35
You may find the CUDA samples on dynamic parallelism for more detail. They contain both command line options and project settings for all supported OS.
http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-quicksort--cuda-dynamic-parallelism-

Kernel seem not to execute

I'm a beginner when it comes to CUDA programming, but this situation doesn't look complex, yet it doesn't work.
#include <cuda.h>
#include <cuda_runtime.h>
#include <iostream>
__global__ void add(int *t)
{
t[2] = t[0] + t[1];
}
int main(int argc, char **argv)
{
int sum_cpu[3], *sum_gpu;
sum_cpu[0] = 1;
sum_cpu[1] = 2;
sum_cpu[2] = 0;
cudaMalloc((void**)&sum_gpu, 3 * sizeof(int));
cudaMemcpy(sum_gpu, sum_cpu, 3 * sizeof(int), cudaMemcpyHostToDevice);
add<<<1, 1>>>(sum_gpu);
cudaMemcpy(sum_cpu, sum_gpu, 3 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << sum_cpu[2];
cudaFree(sum_gpu);
return 0;
}
I'm compiling it like this
nvcc main.cu
It compiles, but the returned value is 0. I tried printing from within the kernel and it won't print so I assume i doesn't execute. Can you explain why?
I checked your code and everything is fine. It seems to me, that you are compiling it wrong (assuming you installed the CUDA SDK properly). Maybe you are missing some flags... That's a bit complicated in the beginning I think. Just check which compute capability your GPU has.
As a best practice I am using a Makefile for each of my CUDA projects. It is very easy to use when you first correctly set up your paths. A simplified version looks like this:
NAME=base
# Compilers
NVCC = nvcc
CC = gcc
LINK = nvcc
CUDA_INCLUDE=/opt/cuda
CUDA_LIBS= -lcuda -lcudart
SDK_INCLUDE=/opt/cuda/include
# Flags
COMMONFLAGS =-O2 -m64
NVCCFLAGS =-gencode arch=compute_20,code=sm_20 -m64 -O2
CXXFLAGS =
CFLAGS =
INCLUDES = -I$(CUDA_INCLUDE)
LIBS = $(CUDA_LIBS)
ALL_CCFLAGS :=
ALL_CCFLAGS += $(NVCCFLAGS)
ALL_CCFLAGS += $(addprefix -Xcompiler ,$(COMMONFLAGS))
OBJS = cuda_base.o
# Build rules
.DEFAULT: all
all: $(OBJS)
$(LINK) -o $(NAME) $(LIBS) $(OBJS)
%.o: %.cu
$(NVCC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.c
$(NVCC) -ccbin $(CC) -c $(ALL_CCFLAGS) $(INCLUDES) $<
%.o: %.cpp
$(NVCC) -ccbin $(CXX) -c $(ALL_CCFLAGS) $(INCLUDES) $<
clean:
rm $(OBJS) $(NAME)
Explanation
I am using Arch Linux x64
the code is stored in a file called cuda_base.cu
the path to my CUDA SDK is /opt/cuda (maybe you have a different path)
most important: Which compute capability has your card? Mine is a GTX 580 with maximum compute capability 2.0. So I have to set as an NVCC flag arch=compute_20,code=sm_20, which stands for compute capability 2.0
The Makefile needs to be stored besides cuda_base.cu. I just copy & pasted your code into this file, then typed in the shell
$ make
nvcc -c -gencode arch=compute_20,code=sm_20 -m64 -O2 -Xcompiler -O2 -Xcompiler -m64 -I/opt/cuda cuda_base.cu
nvcc -o base -lcuda -lcudart cuda_base.o
$ ./base
3
and got your result.
Me and a friend of mine created a base template for writing CUDA code. You can find it here if you like.
Hope this helps ;-)
I've had the exact same problems. I've tried the vector sum example from 'CUDA by example', Sanders & Kandrot. I typed in the code, added the vectors together, out came zeros.
CUDA doesn't print error messages to the console, and only returns error codes from the functions like CUDAMalloc and CUDAMemcpy. In my desire to get a working example, I didn't check the error codes. A basic mistake. So, when I ran the version which loads up when I start a new CUDA project in Visual Studio, and which does do error checking, bingo! an error. The error message was 'invalid device function'.
Checking out the compute capability of my card, using the program in the book or equivalent, indicated that it was...
... wait for it...
1.1
So, I changed the compile options. In Visual Studio 13, Project -> Properties -> Configuration Properties -> CUDA C/C++ -> Device -> Code Generation.
I changed the item from compute_20,sm_20 to compute_11,sm_11. This indicates that the compute capability is 1.1 rather than the assumed 2.0.
Now, the rebuilt code works as expected.
I hope that is useful.

Compile Error Changing Backend System of Thrust with CUDA 5

I installed CUDA 5 recently and found existing code based on Thrust cannot be compiled. The error only happens if I switch to OMP or TBB.
So I did an experiment using monte_carlo.cpp from Thrust example.
When I using include path of CUDA 5.0, I got this error:
g++ -O2 -o monte_carlo monte_carlo.cpp
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -fopenmp -I /usr/local/cuda-5.0/include/
/tmp/ccFsJtAs.o: In function main': monte_carlo.cpp:(.text+0xa0):
undefined reference tofloat
thrust::detail::backend::cuda::reduce_n, float,
thrust::use_default>, long, float, thrust::plus
(thrust::transform_iterator, float,
thrust::use_default>, long, float, thrust::plus)'
But if I change to CUDA 4.1 using
g++ -O2 -o monte_carlo monte_carlo.cpp
-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_OMP -fopenmp -I /usr/local/cuda-4.1/include/
There is no error.
And my platform is Ubuntu 10.04 with g++ 4.4.3.
Hope anyone can help me, thanks!
Edit
OMP problem is solved by changing the order of -fopenmp as #Robert pointed out, but I failed to compile using TBB
g++ -O2 -o monte_carlo monte_carlo.cpp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_TBB -ltbb -I /usr/local/cuda/include/
/tmp/ccxSmcnJ.o: In function main':
monte_carlo.cpp:(.text+0xa0): undefined reference tofloat thrust::detail::backend::cuda::reduce_n, float, thrust::use_default>, long, float, thrust::plus >(thrust::transform_iterator, float, thrust::use_default>, long, float, thrust::plus)'
collect2: ld returned 1 exit status
But the compilation successes if I use
g++ -O2 -o monte_carlo monte_carlo.cpp -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_TBB -ltbb -I /usr/local/cuda-4.1/include/
The OpenMP compilation would appear to have been caused by incorrectly specified compilation arguments. Compiling using
g++ -O2 -o monte_carlo monte_carlo.cpp -fopenmp -DTHRUST_DEVICE_BACKEND=THRUST_DEVICE_BACKEND_OMP -lgomp -I\usr\local\cuda\include
(i.e. specifying OpenMP code generation before any pre-processor directives) allowed for correct compilation using the thrust OpenMP backed.
The reported TBB back end compilation error would appear to have been caused by attempting to use the TBB backend on thrust 1.5.3, which has no TBB support.
[This answer has been assembled from question edits and comments to get the question off the unanswered list for the CUDA tag]