Compiling my CUDA program with libraries provided in toolkit - cuda

I wrote simple CUDA c++ program simulating diffusion on 2D matrix. I got in trouble when I tried to used some of the libraries which are provided in Toolkit. I would like to replace my extremely inefficient matrix transpose kernel with something from cuBlas and also implCU with cuSolvers implementation of solving linear systems. Trouble is that I dont know how to use the functions or compile them. Its working with Makefiles on sample codes provided by Nvidia. If someone would help me, ideally showing me how are these functions supposed to be used when writing .cu files, I would be grateful.
Here is the code: http://pastebin.com/UKhJZQBz
I am on Ubuntu 16.04 and I have exported the PATH variables (so they include /usr/local/cuda-8.0/bin) as is written in official guide.
Here is the output from nvcc -I /usr/local/cuda-8.0/samples/common/inc/ difusion2d.cu
/tmp/tmpxft_00001c09_00000000-16_difusion2d.o: In function `csr_mat_norminf(int, int, int, cusparseMatDescr*, double const*, int const*, int const*)':
undefined reference to `cusparseGetMatIndexBase'
/tmp/tmpxft_00001c09_00000000-16_difusion2d.o: In function `display_matrix(int, int, int, cusparseMatDescr*, double const*, int const*, int const*)':
undefined reference to `cusparseGetMatIndexBase'
/tmp/tmpxft_00001c09_00000000-16_difusion2d.o: In function `main':
undefined reference to `cusolverDnCreate'
undefined reference to `cublasCreate_v2'
undefined reference to `cusolverDnSetStream'
undefined reference to `cublasSetStream_v2'
collect2: error: ld returned 1 exit status

You must explicitly link the cublas and cusolver libraries. Something like
nvcc -I /usr/local/cuda-8.0/samples/common/inc \
-L/path/to/CUDA/libraries difusion2d.cu -lcublas -lcusolver
should work. Depending on your installation, the -L option to provide a search path to the libraries may or may not be necessary.

Related

Undefined Symbol Error when using thrust::max_element

I am working on a CUDA C++ project that uses separable compilation, and I am having some trouble getting a thrust function to compile.
The project builds with no problem until the following function call is added.
thrust::device_ptr<float> max_int = thrust::max_element(
thrust::device_ptr<float>(dev_temp_intensity_buffer),
thrust::device_ptr<float>(dev_temp_intensity_buffer + INT_BUF_SIZE);
As said, I get the build error:
Severity Code Description Project File Line Suppression State
Error LNK2019 unresolved external symbol __fatbinwrap_66_tmpxft_00006db0_00000000_18_cuda_device_runtime_compute_61_cpp1_ii_8b1a5d37 referenced in function __cudaRegisterLinkedBinary_66_tmpxft_00006db0_00000000_18_cuda_device_runtime_compute_61_cpp1_ii_8b1a5d37 visualize C:\Users\13\Google Drive\WireMeshOT Rafael\CUDA\simulator\build\src\visualize_intermediate_link.obj 1
The funny thing is that this other thrust function call compiles just fine:
thrust::exclusive_scan(thrust::device_ptr<unsigned int>(dev_ray_alive),
thrust::device_ptr<unsigned int>(dev_ray_alive + NRAYS),
thrust::device_ptr<unsigned int>(dev_scanned_alive_rays));
Obs1: dev_temp_intensity_buffer is a float device pointer, and I am including thrust/extrema.h and thrust/device_ptr.h.
Obs2: I am using CMake to configure the build. The relevant CMake code excerpts are shown below.
SET(CUDA_SEPARABLE_COMPILATION ON)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -rdc=true -D_FORCE_INLINES)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -arch=compute_52 -code=sm_52 -lcudart -lcudadevrt -lcuda)
set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -Xptxas -v)
cuda_add_executable(
project
file1.cu
...)
target_link_libraries (project glut glew)
I finally figured it out!
The linking problem was due to the fact that cudadevrt library was missing. The catch is that only adding -lcudadevrt to the CUDA_NVCC_FLAGS was not enough!
The problem goes away when linking the CUDA runtime device library to the CMake target as shown below:
target_link_libraries(project glut glew ${CUDA_cudadevrt_LIBRARY})
Obs1: the CUDA_cudadevrt_LIBRARY variable is only made available on CMake versions above 3.7.2. Adding the line cmake_minimum_required(VERSION 3.7.2) is a good idea.
Obs2: linking only to CUDA_LIBRARIES as below does solve the issue only if you are using a CMake version above 3.7.2. On lower versions this variable exist but does not contain cudadevrt library.
target_link_libraries(project glut glew ${CUDA_LIBRARIES})

Accessing GSL Library with Cython

I am trying to use the GSL library in a Cython program but don't seem to have the paths correctly specified; I encounter the following error when I try writing a simple example:
%load_ext cythonmagic
%%cython -lgsl -lgslcblas
cdef extern from "gsl/gsl_ran_poisson_pdf.h":
double gsl_ran_poisson_pdf(int x, double mu)
def poison(int x, double mu):
return gsl_ran_poisson_pdf(x,mu)
/Users/name/.ipython/cython/_cython_magic_189673701925d12059c18b75663da8bd.c:317:10: fatal error:
'gsl/gsl_mode.h' file not found
#include "gsl/gsl_mode.h"
I get the same error using CythonGSL and the demo program here: http://nbviewer.ipython.org/github/twiecki/CythonGSL/blob/master/examples/cython_gsl_ipythonnb.ipynb
The GSL libraries are located in the following directories:
-I/usr/local/include
-L/usr/local/lib -lgsl
I know that similar questions have been asked on SO before, but I couldn't find one relevant to my situation and system (I'm using OS-X). Any help would be appreciated.
Thanks!

Online compilation of single CUDA function

I have a function in my program called float valueAt(float3 v). It's supposed to return the value of a function at the given point. The function is user-specified. I have an interpreter for this function at the moment, but others recommended I compile the function online so it's in machine code and is faster.
How do I do this? I believe I know how to load the function when I have PTX generated, but I have no idea how to generate the PTX.
CUDA provides no way of runtime compilation of non-PTX code.
What you want can be done, but not using the standard CUDA APIs. PyCUDA provides an elegant just-in-time compilation method for CUDA C code which includes behind the scenes forking of the toolchain to compile to device code and loading using the runtime API. The (possible) downside is that you need to use Python for the top level of your application, and if you are shipping code to third parties, you might need to ship a working Python distribution too.
The only other alternative I can think of is OpenCL, which does support runtime compilation (that is all it supported until recently). The C99 language base is a lot more restrictive than what CUDA offers, and I find the APIs to be very verbose, but the runtime compilation model works well.
I've thought about this problem for a while, and while I don't think this is a "great" solution, it does seem to work so I thought I would share it.
The basic idea is to use linux to spawn processes to compile and then run the compiled code. I think this is pretty much a no-brainer, but since I put together the pieces, I'll post instructions here in case it's useful for somebody else.
The problem statement in the question is to be able to take a file that contains a user-defined function, let's assume it is a function of a single variable f(x), i.e. y = f(x), and that x and y can be represented by float quantities.
The user would edit a file called fx.txt that contains the desired function. This file must conform to C syntax rules.
fx.txt:
y=1/x
This file then gets included in the __device__ function that will be holding it:
user_testfunc.cuh:
__device__ float fx(float x){
float y;
#include "fx.txt"
;
return y;
}
which gets included in the kernel that is called via a wrapper.
cudalib.cu:
#include <math.h>
#include "cudalib.h"
#include "user_testfunc.cuh"
__global__ void my_kernel(float x, float *y){
*y = fx(x);
}
float cudalib_compute_fx(float x){
float *d, *h_d;
h_d = (float *)malloc(sizeof(float));
cudaMalloc(&d, sizeof(float));
my_kernel<<<1,1>>>(x, d);
cudaMemcpy(h_d, d, sizeof(float), cudaMemcpyDeviceToHost);
return *h_d;
}
cudalib.h:
float cudalib_compute_fx(float x);
The above files get built into a shared library:
nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so
We need a main application to use this shared library.
t452.cu:
#include <stdio.h>
#include <stdlib.h>
#include "cudalib.h"
int main(int argc, char* argv[]){
if (argc == 1){
// recompile lib, and spawn new process
int retval = system("nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so");
char scmd[128];
sprintf(scmd, "%s skip", argv[0]);
retval = system(scmd);}
else { // compute f(x) at x = 2.0
printf("Result is: %f\n", cudalib_compute_fx(2.0));
}
return 0;
}
Which is compiled like this:
nvcc -arch=sm_20 -o t452 t452.cu -L. -lmycudalib
At this point, the main application (t452) can be executed and it will produce the result of f(2.0) which is 0.5 in this case:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 0.500000
The user can then modify the fx.txt file:
$ vi fx.txt
$ cat fx.txt
y = 5/x
And just re-run the app, and the new functional behavior is used:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 2.500000
This method takes advantage of the fact that upon recompilation/replacement of a shared library, a new linux process will pick up the new shared library. Also note that I've omitted several kinds of error checking for clarity. At a minimum I would check CUDA errors, and I would also probably delete the shared object (.so) library before recompiling it, and then test for its existence after compilation, to do a basic test that the compilation proceeded successfully.
This method entirely uses the runtime API to achieve this goal, so as a result the user would have to have the CUDA toolkit installed on their machine and appropriately set up so that nvcc is available in the PATH. Using the driver API with PTX code would make this process much cleaner (and not require the toolkit on the user's machine), but AFAIK there is no way to generate PTX from CUDA C without using nvcc or a user-created toolchain built on the nvidia llvm compiler tools. In the future, there may be a more "integrated" approach available in the "standard" CUDA C toolchain, or perhaps even by the driver.
A similar approach can be arranged using separate compilation and linking of device code, such that the only source code that needs to be exposed to the user is in user_testfunc.cu (and fx.txt).
EDIT: There is now a CUDA runtime compilation facility, which should be used in place of the above.

Undefined symbol (linking .so C and Cython Code)

Since I made some progress, I changed the title and made a second edit describing my new problem. You may choose to ignore Edit1
I have been trying to run python code from C code. And for this purpose I have been using Cython.
The semantics of my system is such that there is a binary (whos source I can not access) that calls a C function defined in a file (source is accessible) and within this function I need to call python functions, do some processing and return the result to binary.
To achieve this purpose, there are two approaches that I came across:
http://docs.python.org/release/2.5.2/ext/callingPython.html ===> This approach suggests to have the python callback function passed to the C side, so that the callback is called as necessary, but this doesn't work for me as I don't have access to the binary's source (which is used to run the entire system)
https://stackoverflow.com/a/5721123/1126425 ==> I have tried this approach and I get this error when the cython function is called:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb47deb70 (LWP 2065)]
0x007fd38a in PySys_GetObject () from /usr/lib/libpython2.6.so.1.0
http://www.linuxjournal.com/article/8497?page=0,0 ==> This is in fact the basis for cython's functionality but again when I use the examples described there, I get errors similar to 2.
I have no idea how to resolve these errors. Any help would be much appreciated.
Thanks!!
Edit1:
here is a simple scenario that reflects situation:
external.c
#include <external.h>
int callback(int param1,int param2)//Function that the binary calls
{
/*SomeTasks*/
cython_func();//Function defined in the following .pyx file
/*SomeTasks*/
}
cython_file.pyx
cdef void cython_function():
print "Do Nothing!"
I am linking the shared library file created by cython with the library generated by compiling the above C code and then that library is used by the binary...
Edit2:
The segmentation fault goes away when I added Py_Initialize(); before calling cython_function(). But now I am getting the undefined symbol error as : symbol lookup error: lib_c_code.so: undefined symbol: cython_function
Here lib_c_code.so is the shared library created out of the external.c file above. I have tried including the .h file created by the cython compiler in external.c but it still didn't work out.. Here is how I am compiling lib_c_code.so:
gcc -shared -dynlib -lm -W1 -o lib_c_code.so $(OBJDIR)/*.o -lc -lm -lpy_code
and the libpy_code.so is the shared object file that was created out of the cython_file.pyx file as:
cython cython_file.pyx -o cython_file.c
gcc $(IFLAGS) -I/usr/include/python2.6 -fPIC -shared cython_file.c -lpython2.6 -lm -o libpy_code.so
Also, I can see the symbol cython_function in the lib_c_code.so file when I do : nm -g lib_c_code.so..
Any ideas please?
I have to guess here that there's a callback registration function to which you can pass the function pointer, in which case you can simply forego the C file and define a cdef function directly in your Cython code, and pass that with the callback registration function. Use with gil in case you manipulate any Python objects in it.
cdef extern from "external.h":
ctypedef int (*Cb_Func)(int param1, int param2)
void register_callback(Cb_Func func)
cdef int my_callback(int param1,int param2) with gil:
<implementation>
register_callback(my_callback)
This is also explained in the Cython user manual here: http://docs.cython.org/src/userguide/external_C_code.html

can cuda + thrust projects be splitted to more than 1 file

im building and project which uses both Thrust (cuda api) and openMP technologies.the main purpose of my program is to present an interface to calculate something , simultaneously speaking.
in order to do that i've decided to use the STRATEGY design pattern , which basically means that we need to define a base class with a virtual function , and then other classes to derive from that base class and implement the needed function.
my problem starts here :
1 . can my project has more than 1 .CU file?
2 . can CU files have decleration of classes?
class foo
{
int m_name;
void doSomething();
}
3. this one continues 2. , i've head that DEVICE kernels can not be declared inside classes and has to be done like this :
//header file
__DEVICE__ void kernel(int x, inty)
{.....
}
class a : foo
{
void doSomething();
}
//cu file
void a::doSomething()
{
kernel<<<1,1>>>......();
}
is it the right way?
4.last question is , we i use THRUST , must i use CU files as well?
Thanks , igal
Yes, you can use multiple .cu files in your project.
Yes, but there are restrictions. According to *CUDA_C_Programming_Guide* v.4.0, section 3.1.5:
The front end of the compiler processes CUDA source files according to C++ syntax rules. Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in Appendix D. As a consequence of the use of C++ syntax rules, void pointers (e.g., returned by malloc()) cannot be assigned to non-void pointers without a typecast.
You're ALMOST correct. You have to use __global__ keyword when declaring your kernel.
__global__ void kernel(int x, inty)
{.....
}
Well, yes. Actually your thrust-boosted device code should be compiled with nvcc. See thrust documentation for details.
In general, you will compile your programs like that:
$ nvcc -c device.cu
$ g++ -c host.cpp -I/usr/local/cuda/include/
$ nvcc device.o host.o
Alternatively, you can use g++ to perform final linking step.
$ g++ tester device.o host.o -L/usr/local/cuda/lib64 -lcudart
On Windows change the paths after -I and -L. Also, as far as I know, you have to use cl compiler (MS Visual Studio).
Note 1:
Watch out for x86/x64 compatibility: if you use 64-bit CUDA Toolkit, use also a 64-bit compiler. (check -m32 and -m64 options of nvcc also)
Note 2:
device.cu contains kernels and a function that invokes kernel(s). This function has to be annotated with extern "C".
It can contain classes (limitations apply).
host.cpp contains pure C++ code with a extern "C" declaration of the function that is in device.cu (NOT kernel).