Call cublas in a kernel

Call cublas in a kernel - cuda

I want to use Zgemv in parallel.
__global__ void S_Cphir(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l)
{
....
cublasZgemv(handle,CUBLAS_OP_N,n,n,&alpha,S+i*n*n,n,A+n*i,1,&beta,B+i*n,1);}
void S_Cphir_(cuDoubleComplex *S,cuDoubleComplex *A,cuDoubleComplex *B, int n,int l){
dim3 grid = dim3(1,1,1);
dim3 block = dim3(32,1,1);
S_Cphir<<<grid,block>>>(S,A,B,n,l);}
my compile command is
nvcc -c -arch=compute_30 -code=sm_35 time_propagation_cublas.cu --relocatable-device-code true
nvcc -o ./main.v2 time_propagation_cublas.o -lcublas
The first line is work. But the second line is wrong!!
In function`__sti____cudaRegisterAll_58_tmpxft_000032b7_00000000_6_time_propagation_cublas_cpp1_ii_0d699356()';tmpxft_000032b7_00000000-3_time_propagation_cublas.cudafe1.cpp:(.text+0x17a4):
undefined reference to `__cudaRegisterLinkedBinary_58_tmpxft_000032b7_00000000_6_time_propagation_cublas_cpp1_ii_0d699356'
collect2: ld returned 1 exit status
I search the "cudaRegisterLinkedBinary" but I have nothing!!
I know nvcc support to call cublas in kernel.

Use the CUBLAS Device Library sample code as your reference. On a standard CUDA 5.5 install, you'll find it at:
/usr/local/cuda/samples/7_CUDALibraries/simpleDevLibCUBLAS
Referring to the Makefile in that directory, your compile commands should be like this:
nvcc -arch=sm_35 -rdc=true -o main.v2 time_propagation_cublas.cu -lcublas -lcublas_device -lcudadevrt

Related

How to call a Thrust function in a stream from a kernel?

I want to make thrust::scatter asynchronous by calling it in a device kernel(I could also do it by calling it in another host thread). thrust::cuda::par.on(stream) is host function that cannot be called from a device kernel. The following code was tried with CUDA 10.1 on Turing architecture.
__global__ void async_scatter_kernel(float* first,
float* last,
int* map,
float* output)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
}
I know thrust uses dynamic parallelism to launch its kernels when called from the device, however I couldn't find a way to specify the stream.

The following code compiles cleanly for me on CUDA 10.1.243:
$ cat t1518.cu
#include <thrust/scatter.h>
#include <thrust/execution_policy.h>
__global__ void async_scatter_kernel(float* first,
float* last,
int* map,
float* output)
{
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
thrust::scatter(thrust::cuda::par.on(stream), first, last, map, output);
cudaDeviceSynchronize();
cudaStreamDestroy(stream);
}
int main(){
float *first = NULL;
float *last = NULL;
float *output = NULL;
int *map = NULL;
async_scatter_kernel<<<1,1>>>(first, last, map, output);
cudaDeviceSynchronize();
}
$ nvcc -arch=sm_35 -rdc=true t1518.cu -o t1518
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
$
The -arch=sm_35 (or similar) and -rdc=true are necessary (but not in all cases sufficient) compile switches for any code that uses CUDA Dynamic Parallelism. If you omit, for example, the -rdc=true switch, you get an error similar to what you describe:
$ nvcc -arch=sm_35 t1518.cu -o t1518
t1518.cu(11): error: calling a __host__ function("thrust::cuda_cub::par_t::on const") from a __global__ function("async_scatter_kernel") is not allowed
t1518.cu(11): error: identifier "thrust::cuda_cub::par_t::on const" is undefined in device code
2 errors detected in the compilation of "/tmp/tmpxft_00003a80_00000000-8_t1518.cpp1.ii".
$
So, for the example you have shown here, your compilation error can be eliminated either by updating to the latest CUDA version or by specifying the proper command line, or both.

cuda & rdc & thrust in multiple shared objects results in SIGSEV in registerEntryFunction

I'm trying to run relocatable-device-code in two shared libraries, both using cuda-thrust. Everything runs fine if I stop using thrust in kernel.cu, which is not an option.
edit: The program works too if rdc is disabled. Not an option for me either.
It compiles fine but stops with a segfault when run. gdb tells me this:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000422cc8 in cudart::globalState::registerEntryFunction(void**, char const*, char*, char const*, int, uint3*, uint3*, dim3*, dim3*, int*) ()
(cuda-gdb) bt
#0 0x0000000000422cc8 in cudart::globalState::registerEntryFunction(void**, char const*, char*, char const*, int, uint3*, uint3*, dim3*, dim3*, int*) ()
#1 0x000000000040876c in __cudaRegisterFunction ()
#2 0x0000000000402b58 in __nv_cudaEntityRegisterCallback(void**) ()
#3 0x00007ffff75051a3 in __cudaRegisterLinkedBinary(__fatBinC_Wrapper_t const*, void (*)(void**), void*) ()
from /home/mindoms/rdctestmcsimple/libkernel.so
#4 0x00007ffff75050b1 in __cudaRegisterLinkedBinary_66_tmpxft_00007a5f_00000000_16_cuda_device_runtime_ compute_52_cpp1_ii_8b1a5d37 () from /home/user/rdctestmcsimple/libkernel.so
#5 0x000000000045285d in __libc_csu_init ()
#6 0x00007ffff65ea50f in __libc_start_main () from /lib64/libc.so.6
Here is my stripped down example (using cmake) that shows the error.
main.cpp:
#include "kernel.cuh"
#include "kernel2.cuh"
int main(){
Kernel k;
k.callKernel();
Kernel2 k2;
k2.callKernel2();
}
kernel.cuh:
#ifndef __KERNEL_CUH__
#define __KERNEL_CUH__
class Kernel{
public:
void callKernel();
};
#endif
kernel.cu:
#include "kernel.cuh"
#include <stdio.h>
#include <iostream>
#include <thrust/device_vector.h>
__global__
void thekernel(int *data){
if (threadIdx.x == 0)
printf("the kernel says hello\n");
data[threadIdx.x] = threadIdx.x * 2;
}
void Kernel::callKernel(){
thrust::device_vector<int> D2;
D2.resize(11);
int * raw_ptr = thrust::raw_pointer_cast(&D2[0]);
printf("Kernel::callKernel called\n");
thekernel <<< 1, 10 >>> (raw_ptr);
cudaThreadSynchronize();
cudaError_t code = cudaGetLastError();
if (code != cudaSuccess) {
std::cout << "Cuda error: " << cudaGetErrorString(code) << " after callKernel!" << std::endl;
}
for (int i = 0; i < D2.size(); i++)
std::cout << "Kernel D[" << i << "]=" << D2[i] << std::endl;
}
kernel2.cuh:
#ifndef __KERNEL2_CUH__
#define __KERNEL2_CUH__
class Kernel2{
public:
void callKernel2();
};
#endif
kernel2.cu
#include "kernel2.cuh"
#include <stdio.h>
#include <iostream>
#include <thrust/device_vector.h>
__global__
void thekernel2(int *data2){
if (threadIdx.x == 0)
printf("the kernel2 says hello\n");
data2[threadIdx.x] = threadIdx.x * 2;
}
void Kernel2::callKernel2(){
thrust::device_vector<int> D;
D.resize(11);
int * raw_ptr = thrust::raw_pointer_cast(&D[0]);
printf("Kernel2::callKernel2 called\n");
thekernel2 <<< 1, 10 >>> (raw_ptr);
cudaThreadSynchronize();
cudaError_t code = cudaGetLastError();
if (code != cudaSuccess) {
std::cout << "Cuda error: " << cudaGetErrorString(code) << " after callKernel2!" << std::endl;
}
for (int i = 0; i < D.size(); i++)
std::cout << "Kernel2 D[" << i << "]=" << D[i] << std::endl;
}
The cmake file below was used originally, but I get the same problem when I compile "by hand":
nvcc -arch=sm_35 -Xcompiler -fPIC -dc kernel2.cu
nvcc -arch=sm_35 -shared -Xcompiler -fPIC kernel2.o -o libkernel2.so
nvcc -arch=sm_35 -Xcompiler -fPIC -dc kernel.cu
nvcc -arch=sm_35 -shared -Xcompiler -fPIC kernel.o -o libkernel.so
g++ -o main main.cpp libkernel.so libkernel2.so -L/opt/cuda/current/lib64
Adding -cudart shared to every nvcc call as suggested somewhere results in a different error:
warning: Cuda API error detected: cudaFuncGetAttributes returned (0x8)
terminate called after throwing an instance of 'thrust::system::system_error'
what(): function_attributes(): after cudaFuncGetAttributes: invalid device function
Program received signal SIGABRT, Aborted.
0x000000313c432625 in raise () from /lib64/libc.so.6
(cuda-gdb) bt
#0 0x000000313c432625 in raise () from /lib64/libc.so.6
#1 0x000000313c433e05 in abort () from /lib64/libc.so.6
#2 0x00000031430bea7d in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3 0x00000031430bcbd6 in std::set_unexpected(void (*)()) () from /usr/lib64/libstdc++.so.6
#4 0x00000031430bcc03 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5 0x00000031430bcc86 in __cxa_rethrow () from /usr/lib64/libstdc++.so.6
#6 0x00007ffff7d600eb in thrust::detail::vector_base<int, thrust::device_malloc_allocator<int> >::append(unsigned long) () from ./libkernel.so
#7 0x00007ffff7d5f740 in thrust::detail::vector_base<int, thrust::device_malloc_allocator<int> >::resize(unsigned long) () from ./libkernel.so
#8 0x00007ffff7d5b19a in Kernel::callKernel() () from ./libkernel.so
#9 0x00000000004006f8 in main ()
CMakeLists.txt: Please adjust to your environment
cmake_minimum_required(VERSION 2.6.2)
project(Cuda-project)
set(CMAKE_MODULE_PATH "${CMAKE_SOURCE_DIR}/CMake/cuda" ${CMAKE_MODULE_PATH})
SET(CUDA_TOOLKIT_ROOT_DIR "/opt/cuda/current")
SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -gencode arch=compute_52,code=sm_52)
find_package(CUDA REQUIRED)
link_directories(${CUDA_TOOLKIT_ROOT_DIR}/lib64)
set(CUDA_SEPARABLE_COMPILATION ON)
set(BUILD_SHARED_LIBS ON)
list(APPEND CUDA_NVCC_FLAGS -Xcompiler -fPIC)
CUDA_ADD_LIBRARY(kernel
kernel.cu
)
CUDA_ADD_LIBRARY(kernel2
kernel2.cu
)
cuda_add_executable(rdctest main.cpp)
TARGET_LINK_LIBRARIES(rdctest kernel kernel2 cudadevrt)
About my system:
Fedora 23
kernel: 4.4.2-301.fc23.x86_64
Nvidia Driver: 361.28
Nvidia Toolkit: 7.5.18
g++: g++ (GCC) 5.3.1 20151207 (Red Hat 5.3.1-2)
Reproduced on:
CentOS release 6.7 (Final)
Kernel: 2.6.32-573.8.1.el6.x86_64
Nvidia Driver: 352.55
Nvidia Toolkit: 7.5.18
g++ (GCC) 4.4.7 20120313 (Red Hat 4.4.7-16)
glibc 2.12
cmake to 3.5

Apparently, this has something to do with what cuda runtime is used: shared or static.
I slightly modified your example: Instead of building two shared libraries and linking them to the executable individually, I create two static libraries that are linked together to one shared library, and that one is linked to the executable.
Also, here is an updated CMake file that uses the new (>= 3.8) native CUDA language support.
cmake_minimum_required(VERSION 3.8)
project (CudaSharedThrust CXX CUDA)
string(APPEND CMAKE_CUDA_FLAGS " -gencode arch=compute_61,code=compute_61")
if(BUILD_SHARED_LIBS)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
endif()
add_library(kernel STATIC kernel.cu)
set_target_properties(kernel PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
add_library(kernel2 STATIC kernel2.cu)
set_target_properties(kernel2 PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
add_library(allkernels empty.cu) # empty.cu is an empty file
set_target_properties(allkernels PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(allkernels kernel kernel2)
add_executable(rdctest main.cpp)
set_target_properties(rdctest PROPERTIES CUDA_SEPARABLE_COMPILATION ON)
target_link_libraries(rdctest allkernels)
Building this without any CMake flags (static build), the build succeeds and the program works.
Building with -DBUILD_SHARED_LIBS=ON, the program compiles, but it crashes with the same error is yours.
Building with
cmake .. -DBUILD_SHARED_LIBS=ON -DCMAKE_CUDA_FLAGS:STRING="--cudart shared"
compiles, and actually makes it run! So for some reason, the shared CUDA runtime is required for this sort of thing.
Also note that the step from 2 SO's -> 2 Static Libs in 1 SO was necessary, because otherwise the program would crash with a hrust::system::system_error.
This, however is expected because NVCC actually ignores shared object files during device linking: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#libraries

How to compile multiple files in cuda?

I did it the way I do with gcc
nvcc a.cu ut.cu
but the compiler shows
ptxas fatal : Unresolved extern function '_Z1fi'
The problem only occurs when the function is __device__ function.
[File ut.h]
__device__ int f(int);
[File ut.c]
#include "ut.h"
__device__ int f(int a){
return a*a;
}
[File a.cu]
#include "ut.h"
__global__ void mk(){
f(5);
}
int main(){
mk<<<1,1>>>();
}

When a __device__ or __global__ function calls a __device__ function (or __global__ function, in the case of dynamic parallelism) in another translation unit (i.e. file), then it is necessary to use device linking. To enable device linking with your simple compile command, just add the -rdc=true switch:
nvcc -rdc=true a.cu ut.cu
That should fix the issue.
Note that in your compile command you list "ut.cu" but in your question you show "ut.c", I assume that should be the file "ut.cu". If not, you will also need to change the file name from "ut.c" to "ut.cu".
You can read more about device linking in the nvcc manual.

Dynamic Parallelism - undefined reference to __cudaRegisterLinkedBinary linking error while compiling - separate compilation

I got a problem when I try to compile a simple code there are C++ and Cuda code compile in a separated way.
Here's my code
main.cpp:
#include "file.cuh"
int main( void )
{
test();
return 0;
}
file.cuh:
void test( void );
file.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#include <cstdio>
#include "file.cuh"
__global__ void printId( void )
{
printf("Hello from block %d \n", blockIdx.x);
}
__global__ void DynPara( void )
{
dim3 grid( 2, 1, 1 );
dim3 block( 1, 1, 1 );
printId<<< grid, block >>>();
}
void test( void )
{
dim3 grid( 1, 1, 1 );
dim3 block( 1, 1, 1 );
dynPara<<< grid, block >>>();
}
I compile with:
nvcc -arch=sm_35 -lcudadevrt -rdc=true -c file.cu
g++ file.o main.cpp -L<path> -lcudart
And here's the error while compiling:
file.o: In function `__sti____cudaRegisterAll_39_tmpxft_00005b2f_00000000_6_file_cpp1_ii_99181f96()':
tmpxft_00005b2f_00000000-3_file.cudafe1.cpp:(.text+0x1cd): undefined reference to `__cudaRegisterLinkedBinary_39_tmpxft_00005b2f_00000000_6_file_cpp1_ii_99181f96'
os: Red Hat
card: K20x
Any idea?
Thanks

This question is pretty much a duplicate of this recent question.
Dynamic parallelism requires relocatable device code linking, in addition to compiling.
Your nvcc command line specifies a compile-only operation (-rdc=true -c).
g++ does not do any device code linking. So in a scenario like this, when doing the final link operation using g++ an extra device code link step is required.
Something like this:
nvcc -arch=sm_35 -rdc=true -c file.cu
nvcc -arch=sm_35 -dlink -o file_link.o file.o -lcudadevrt -lcudart
g++ file.o file_link.o main.cpp -L<path> -lcudart -lcudadevrt

When using CMake, setting CUDA_SEPARABLE_COMPILATION before find_package() enables both relocatable device code compiling and linking:
SET(CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDA QUIET REQUIRED)

Firstly, sorry for my low reputation, I can't comment under Robert Crovella's answer directly
https://stackoverflow.com/a/22116121/14377278
Just like his command, but I need link cuda library when use nvcc and nvlink to compile and link, like below
nvcc -arch=sm_35 -rdc=true -c file.cu -L<path>
nvcc -arch=sm_35 -dlink -o file_link.o file.o -lcudadevrt -lcudart -L<path>
g++ file.o file_link.o main.cpp -L<path> -lcudart -lcudadevrt

Unable to decipher nvlink error

I'm attempting to build a project with nvcc. I am getting the most vexing nvlink error: messages I've ever seen.
Here is the link statement:
nvcc -rdc=true -arch=sm_21 -O3 -Xcompiler -fPIC -I"/usr/local/ACE_wrappers" -I"/usr/local/ACE_wrappers/TAO" -I"/usr/local/DDS" -I"/usr/include/Qt" -I"/usr/include/QtCore" -I"/usr/include/QtGui" -I"../../include" -I"../../include/DDS" -I"../../include/CoordinateTransforms" -I"../../include/DDS/IDLBrokerTemplates" -I"../../def/IDL" -I"../../def/CMD" -I"../../def/XSD" -I"../../src/NetAcquire" -I"/usr/local/ACE_wrappers/TAO/orbsvcs" -I"/usr/local/include/lct.7.5.4" -L"." -L"/usr/local/ACE_wrappers/lib" -L"/usr/local/DDS/lib" -L"/usr/lib64" -L"/usr/local/lib64" -L"../../def/IDL/lib" -L"../../def/XSD" -L"/usr/local/lib" .obj/../../src/Component.o .obj/../../src/COM.o .obj/../../src/DDS/EntityManager.o .obj/../../src/IDLBrokerTemplates/CommandManager.o .obj/../../src/IDLBrokerTemplates/OptionManager.o .obj/../../include/ApplicationProcessStateReporter_moc.o .obj/../../src/Application.o .obj/../../src/CoordinateTransforms/Site.o .obj/../../src/CoordinateTransforms/Geodesy.o .obj/../../src/CoordinateTransforms/Earth.o .obj/../../src/CoordinateTransforms/StateVector.o .obj/../../src/CoordinateTransforms/KeplerianImpact.o .obj/../../src/CoordinateTransforms/GeodeticPosition.o .obj/../../src/IDLBrokerTemplates/MeasurandSubscription.o .obj/../../src/NetAcquire/NetAcquire.o .obj/DataLossFlightTimeImpl.o .obj/DataLossFlightTime.o .obj/DftTable.o .obj/OptionListener.o .obj/PrimaryListener.o .obj/MissionTimeListener.o .obj/DeadMan.o .obj/main.o .obj/../../src/XML/spline.o .obj/../../src/XML/FpTable.o -l"naps-x86_64" -l"naio-x86_64" -l"nalct-x86_64" -l"curl" -l"TAO_Messaging" -l"TAO_Valuetype" -l"TAO_PI_Server" -l"TAO_PI" -l"TAO_CodecFactory" -l"TAO_CosNaming" -l"armadillo" -l"boost_filesystem" -l"boost_system" -l"xerces-c" -l"jarssXSD" -l"OpenDDS_Tcp" -l"JARSSRTv10" -l"QtNetwork" -l"fontconfig" -l"QtGui" -l"QtCore" -l"OpenDDS_Rtps_Udp" -l"OpenDDS_Rtps" -l"OpenDDS_Multicast" -l"OpenDDS_Udp" -l"OpenDDS_InfoRepoDiscovery" -l"OpenDDS_Dcps" -l"TAO_PortableServer" -l"TAO_AnyTypeCode" -l"TAO" -l"ACE" -o "DFT"
And I'm getting
nvlink error : Undefined reference to '_ZN5JARSS15KeplerianImpactC1ERKdS2_S2_S2_S2_S2_'
nvlink error : Undefined reference to '_ZNK5JARSS15KeplerianImpact9getStatusEv'
nvlink error : Undefined reference to '_ZNK5JARSS15KeplerianImpact13getImpactTimeEv'
nvlink error : Undefined reference to '_ZNK5JARSS15KeplerianImpact11getPlhStateEv'
nvlink error : Undefined reference to '_ZN5JARSS15KeplerianImpactD1Ev'
nvlink error : Undefined reference to '_ZN5JARSS7Geodesy12EFG2GeodeticERKdS2_S2_PdS3_S3_'
I'm certain that these functions/files are included in the compile. You can see from the compile that KeplerianImpact.cpp and Geodesy.cpp are in there.
Is there any way to make the link output easier to read so I can debug this?

Use c++filt to demangle the names. For instance:
$ c++filt _ZN5JARSS15KeplerianImpactC1ERKdS2_S2_S2_S2_S2_
JARSS::KeplerianImpact::KeplerianImpact(double const&, double const&, double const&, double const&, double const&, double const&)

Faced this problem earlier, I guess you haven't linked the device object using device linker.
Generate relocatable code for the device by compiling as shown below (-dc is the device equivalent of -c, see the manual for more information)
nvcc –arch=sm_21 –dc a.cu b.cu
Link the device parts of the code by calling nvlink or dlink before the final host link
nvlink -arch=sm_21 a.o b.o -o link.o (or)
nvcc –arch=sm_21 –dlink a.o b.o –o link.o
Finally form a executable using host compiler,
g++ a.o b.o link.o –L<path> -lcudart

I figured this out.
I needed to define my functions in the correct files. For example, in Foo.h:
class Foo {
public:
__host__ __device__
Foo();
}
and the function definition in Foo.cu not Foo.cpp as I originally thought.
Foo::Foo() {}
For the constant variables, I needed to implement a slightly different strategy.
Here is an example of the C++ class that I started with:
class Foo {
public:
static double const epsilon;
static void functionThatUsesEpsilon();
/**/
}
Had to be converted to use the global namespace as the epsilon def'n
namespace foo {
extern __constant__ double epsilon;
}
class Foo {
public:
// same stuff as before with the addition of this function
__host__ __device__
static inline double getEpsilon() {
#ifdef __CUDACC__
return foo::epsilon;
#else
return epsilon;
#endif
}
static void functionThatUsesEpsilon() {
if (bar < getEpsilon()) { // etc }
}
};
The ifdef above will return the correct version of the variable for either the host or the device code. Everywhere I had referenced Foo::epsilon I needed to replace with Foo::getEpsilon() so the correct epsilon was returned.
Hope this helps someone in the future. Thanks to #RobertCrovella for getting me thinking.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Call cublas in a kernel - cuda

Related

How to call a Thrust function in a stream from a kernel?

cuda & rdc & thrust in multiple shared objects results in SIGSEV in registerEntryFunction

How to compile multiple files in cuda?

Dynamic Parallelism - undefined reference to __cudaRegisterLinkedBinary linking error while compiling - separate compilation

Unable to decipher nvlink error

Categories

Resources