im building and project which uses both Thrust (cuda api) and openMP technologies.the main purpose of my program is to present an interface to calculate something , simultaneously speaking.
in order to do that i've decided to use the STRATEGY design pattern , which basically means that we need to define a base class with a virtual function , and then other classes to derive from that base class and implement the needed function.
my problem starts here :
1 . can my project has more than 1 .CU file?
2 . can CU files have decleration of classes?
class foo
{
int m_name;
void doSomething();
}
3. this one continues 2. , i've head that DEVICE kernels can not be declared inside classes and has to be done like this :
//header file
__DEVICE__ void kernel(int x, inty)
{.....
}
class a : foo
{
void doSomething();
}
//cu file
void a::doSomething()
{
kernel<<<1,1>>>......();
}
is it the right way?
4.last question is , we i use THRUST , must i use CU files as well?
Thanks , igal
Yes, you can use multiple .cu files in your project.
Yes, but there are restrictions. According to *CUDA_C_Programming_Guide* v.4.0, section 3.1.5:
The front end of the compiler processes CUDA source files according to C++ syntax rules. Full C++ is supported for the host code. However, only a subset of C++ is fully supported for the device code as described in Appendix D. As a consequence of the use of C++ syntax rules, void pointers (e.g., returned by malloc()) cannot be assigned to non-void pointers without a typecast.
You're ALMOST correct. You have to use __global__ keyword when declaring your kernel.
__global__ void kernel(int x, inty)
{.....
}
Well, yes. Actually your thrust-boosted device code should be compiled with nvcc. See thrust documentation for details.
In general, you will compile your programs like that:
$ nvcc -c device.cu
$ g++ -c host.cpp -I/usr/local/cuda/include/
$ nvcc device.o host.o
Alternatively, you can use g++ to perform final linking step.
$ g++ tester device.o host.o -L/usr/local/cuda/lib64 -lcudart
On Windows change the paths after -I and -L. Also, as far as I know, you have to use cl compiler (MS Visual Studio).
Note 1:
Watch out for x86/x64 compatibility: if you use 64-bit CUDA Toolkit, use also a 64-bit compiler. (check -m32 and -m64 options of nvcc also)
Note 2:
device.cu contains kernels and a function that invokes kernel(s). This function has to be annotated with extern "C".
It can contain classes (limitations apply).
host.cpp contains pure C++ code with a extern "C" declaration of the function that is in device.cu (NOT kernel).
Related
So I'm trying to start on GPU programming and using the Thrust library to simplify things.
I have created a test program to work with it and see how it works, however whenever I try to create a thrust::device_vector with non-zero size the program crashes with "Run-time Check Failure #3 - The variable 'result' is being used without being initialized.' (this comes from the allocator_traits.inl file) And... I have no idea how to fix this.
The following is all that is needed to cause this error.
#include <thrust/device_vector.h>
int main()
{
int N = 100;
thrust::device_vector<int> d_a(N);
return 0;
}
I suspect it may be a problem with how the environment is set up so the details on that are...
Created using visual studio 2019, in a CUDA 11.0 Runtime project (the example program given when opening this project works fine, however), Thrust version 1.9, and the given GPU is a GTX 970.
This issue only seems to manifest with the thrust version (1.9.x) associated with CUDA 11.0, and only in debug projects on windows/Visual Studio.
Some workarounds would be to switch to building a release project, or just click "Ignore" on the dialogs that appear at runtime. According to my testing this allows ordinary run or debug at that point.
I have not confirmed it, but I believe this issue is fixed in the latest thrust (1.10.x) just released (although not part of any formal CUDA release at this moment, I would expect it to be part of some future CUDA release).
Following the Answer of Robert Crovella, I fixed this issue by changing the corresponding lines of code in the thrust library with the code from GitHub. More precisely, in the file ...\CUDA\v11.1\include\thrust\detail\allocator\allocator_traits.inl I replaced the following function
template<typename Alloc>
__host__ __device__
typename disable_if<
has_member_system<Alloc>::value,
typename allocator_system<Alloc>::type
>::type
system(Alloc &)
{
// return a copy of a default-constructed system
typename allocator_system<Alloc>::type result;
return result;
}
by
template<typename Alloc>
__host__ __device__
typename disable_if<
has_member_system<Alloc>::value,
typename allocator_system<Alloc>::type
>::type
system(Alloc &)
{
// return a copy of a default-constructed system
return typename allocator_system<Alloc>::type();
}
I have a function in my program called float valueAt(float3 v). It's supposed to return the value of a function at the given point. The function is user-specified. I have an interpreter for this function at the moment, but others recommended I compile the function online so it's in machine code and is faster.
How do I do this? I believe I know how to load the function when I have PTX generated, but I have no idea how to generate the PTX.
CUDA provides no way of runtime compilation of non-PTX code.
What you want can be done, but not using the standard CUDA APIs. PyCUDA provides an elegant just-in-time compilation method for CUDA C code which includes behind the scenes forking of the toolchain to compile to device code and loading using the runtime API. The (possible) downside is that you need to use Python for the top level of your application, and if you are shipping code to third parties, you might need to ship a working Python distribution too.
The only other alternative I can think of is OpenCL, which does support runtime compilation (that is all it supported until recently). The C99 language base is a lot more restrictive than what CUDA offers, and I find the APIs to be very verbose, but the runtime compilation model works well.
I've thought about this problem for a while, and while I don't think this is a "great" solution, it does seem to work so I thought I would share it.
The basic idea is to use linux to spawn processes to compile and then run the compiled code. I think this is pretty much a no-brainer, but since I put together the pieces, I'll post instructions here in case it's useful for somebody else.
The problem statement in the question is to be able to take a file that contains a user-defined function, let's assume it is a function of a single variable f(x), i.e. y = f(x), and that x and y can be represented by float quantities.
The user would edit a file called fx.txt that contains the desired function. This file must conform to C syntax rules.
fx.txt:
y=1/x
This file then gets included in the __device__ function that will be holding it:
user_testfunc.cuh:
__device__ float fx(float x){
float y;
#include "fx.txt"
;
return y;
}
which gets included in the kernel that is called via a wrapper.
cudalib.cu:
#include <math.h>
#include "cudalib.h"
#include "user_testfunc.cuh"
__global__ void my_kernel(float x, float *y){
*y = fx(x);
}
float cudalib_compute_fx(float x){
float *d, *h_d;
h_d = (float *)malloc(sizeof(float));
cudaMalloc(&d, sizeof(float));
my_kernel<<<1,1>>>(x, d);
cudaMemcpy(h_d, d, sizeof(float), cudaMemcpyDeviceToHost);
return *h_d;
}
cudalib.h:
float cudalib_compute_fx(float x);
The above files get built into a shared library:
nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so
We need a main application to use this shared library.
t452.cu:
#include <stdio.h>
#include <stdlib.h>
#include "cudalib.h"
int main(int argc, char* argv[]){
if (argc == 1){
// recompile lib, and spawn new process
int retval = system("nvcc -arch=sm_20 -Xcompiler -fPIC -shared cudalib.cu -o libmycudalib.so");
char scmd[128];
sprintf(scmd, "%s skip", argv[0]);
retval = system(scmd);}
else { // compute f(x) at x = 2.0
printf("Result is: %f\n", cudalib_compute_fx(2.0));
}
return 0;
}
Which is compiled like this:
nvcc -arch=sm_20 -o t452 t452.cu -L. -lmycudalib
At this point, the main application (t452) can be executed and it will produce the result of f(2.0) which is 0.5 in this case:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 0.500000
The user can then modify the fx.txt file:
$ vi fx.txt
$ cat fx.txt
y = 5/x
And just re-run the app, and the new functional behavior is used:
$ LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH ./t452
Result is: 2.500000
This method takes advantage of the fact that upon recompilation/replacement of a shared library, a new linux process will pick up the new shared library. Also note that I've omitted several kinds of error checking for clarity. At a minimum I would check CUDA errors, and I would also probably delete the shared object (.so) library before recompiling it, and then test for its existence after compilation, to do a basic test that the compilation proceeded successfully.
This method entirely uses the runtime API to achieve this goal, so as a result the user would have to have the CUDA toolkit installed on their machine and appropriately set up so that nvcc is available in the PATH. Using the driver API with PTX code would make this process much cleaner (and not require the toolkit on the user's machine), but AFAIK there is no way to generate PTX from CUDA C without using nvcc or a user-created toolchain built on the nvidia llvm compiler tools. In the future, there may be a more "integrated" approach available in the "standard" CUDA C toolchain, or perhaps even by the driver.
A similar approach can be arranged using separate compilation and linking of device code, such that the only source code that needs to be exposed to the user is in user_testfunc.cu (and fx.txt).
EDIT: There is now a CUDA runtime compilation facility, which should be used in place of the above.
I have read related tutorials regarding shared and static libraries, such as:
"Creating a shared and static library with the gnu compiler [gcc]"
"Static, Shared Dynamic and Loadable Linux Libraries"
However, unfortunately, all examples they used are one function one .c file.
I have two questions:
(1) If I have one file with two more functions, such as example1.c
void ctest11(int *i)
{ *i = 5; }
void ctest12(int *i)
{ *i = 5; }
After compiling exmaple1.c to libexample1.so, can I call ctest11 and ctest12 in it?
(2) If I have one file with two more functions, one of them is a main function, such as example2.c
void ctest21(int *i)
{ *i = 5; }
void main(int *i)
{ *i = 5; }
After compiling exmaple2.c to libexample2.so, is it the same as to compile one .c file with only ctest21 function?
(3) If I have one file example3.c and exmaple4.c
The funcion in example3.c will use the function in example4.c
For example:
example3.c
void ctest31(int *i)
{ *i = ctest41(2,3); }
example4.c
int ctest41(int a, int b)
{ return a+b; }
When I compile example2.c and example3.c to libexample23.so, can I call both ctest31 and ctest41?
But if gcc example2.c example3.o to libexample2.so, I guess I can only call ctest31?
You should look inside (and build) some existing free software library, compile it, and study its code and building process.
In general, a shared object can be made from several C source files src1sh.c and src2sh.c .... Very often, the compilation is driven by a builder program, usually GNU make
First, you need to compile every source file of the shared object as position-independent-code (PIC) e.g.
gcc -Wall -fPIC src1sh.c -c -o src1sh.pic.o
gcc -Wall -fPIC src2sh.c -c -o src2sh.pic.o
You probably want to add -g to the gcc flags for debugging purposes. Once your program and shared objects are bug free because you have debugged them with gdb and valgrind, pass -O2 to gcc to have them optimized.
Then you need to link all these PIC object files into a single shared object (a *.so file), like
gcc -shared src1sh.pic.o src2sh.pic.o -o shared.so
If your intent is to make a shared library call it lib*.so e.g. libfoo.so and refer to it as -lfoo flag to the linking gcc command using your shared library.
Notice that linking a shared object may also link other shared libraries, so you could do
gcc -shared src1sh.pic.o src2sh.pic.o -lsome -o shared.so
to link some libsome.so into your shared.so
You usually don't compile a shared object containing a main (remember that main is a very special function, described specifically in the C standard, and called from the startup code crt*.o linked by gcc into every program); this is nearly non-sense (like your libexample2.so). Your main is defined in your program (and you don't need PIC code for your program executable). If your program is made from source files src1pr.c and src2pr.c (which defines main) you first compile them as
gcc -Wall src1pr.c -c -o src1pr.o
gcc -Wall src2pr.c -c -o src2pr.o
and you link them all with e.g.
gcc src1pr.o src2pr.o -o prog -lshared
where -lshared refers to a shared library libshared.so (you probably want to compile and link your program files with -g for debugging information, and you may want to pass additional -I flags for include directories, and -L flags for library directories, e.g. -L. to search library in the current directory ...)
There is a way to dynamically link at runtime some shared object, notably for having plugins. You then want to use the dlopen & dlsym functons for that (and you usually want to link your main program with -rdynamic flag).
You can call (from your program) any visible function inside a shared object. You may want to play with the visibility function attribute to e.g. restrict the visibility of some function inside your shared object. You might perhaps want to later use the constructor attribute, for a function inside a shared object to be called early at initialization time (if it is a plugin, at its dlopen time).
Read Program Library Howto and Levine's "Linkers and Loaders" book for more. Linux shared objects (and relocatable object *.o, and executable binaries) are in Executable & Linkable Format (ELF is an industry standard). Some further details are described in the Application Binary Interface (and notably the ABI supplement for your processor, e.g. AMD64 ABI supplement).
PS. You really want a builder like GNU make to combine all these steps, so read its documentation. You might want to pass -v to gcc to understand what it is doing...
Thanks for Basile' great explanation.
From
what I understand, related to my questions,
(a) For my first question (1), there are multiple functions in one object file. I can call ctest11 and ctest12 in libexample1.so
I may set visibility to the functions in libexample1.so.
(b) For my third question (3), the firs scenario is related to create a library from two object files. I can call any functions in the files.
The second scenario is related to create a library and link with another library. I can call any functions in the libraries, including the linking library.
(3) I still do not understand the situations with a main function.
You said, "You usually don't compile a shared object containing a main; this is nearly non-sense (like your libexample2.so). "
I knew it is non-sense. But if I do not want to change the program file, and want to compile it to a library, say, in example2.c, I compile it to example2.so, and want to call the function ctest21.
Can I do that?
example2.c
void ctest21(int *i)
{ *i = 5; }
void main(int *i)
{ *i = 5; }
I compile it to a library.
gcc -fPIC -g -c -Wall example2.c
gcc -shared -o libexample2.so example2.o
I think I can call crest21 function in example2.o. But the main function is useless.
Is my understanding correct?
Since I made some progress, I changed the title and made a second edit describing my new problem. You may choose to ignore Edit1
I have been trying to run python code from C code. And for this purpose I have been using Cython.
The semantics of my system is such that there is a binary (whos source I can not access) that calls a C function defined in a file (source is accessible) and within this function I need to call python functions, do some processing and return the result to binary.
To achieve this purpose, there are two approaches that I came across:
http://docs.python.org/release/2.5.2/ext/callingPython.html ===> This approach suggests to have the python callback function passed to the C side, so that the callback is called as necessary, but this doesn't work for me as I don't have access to the binary's source (which is used to run the entire system)
https://stackoverflow.com/a/5721123/1126425 ==> I have tried this approach and I get this error when the cython function is called:
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xb47deb70 (LWP 2065)]
0x007fd38a in PySys_GetObject () from /usr/lib/libpython2.6.so.1.0
http://www.linuxjournal.com/article/8497?page=0,0 ==> This is in fact the basis for cython's functionality but again when I use the examples described there, I get errors similar to 2.
I have no idea how to resolve these errors. Any help would be much appreciated.
Thanks!!
Edit1:
here is a simple scenario that reflects situation:
external.c
#include <external.h>
int callback(int param1,int param2)//Function that the binary calls
{
/*SomeTasks*/
cython_func();//Function defined in the following .pyx file
/*SomeTasks*/
}
cython_file.pyx
cdef void cython_function():
print "Do Nothing!"
I am linking the shared library file created by cython with the library generated by compiling the above C code and then that library is used by the binary...
Edit2:
The segmentation fault goes away when I added Py_Initialize(); before calling cython_function(). But now I am getting the undefined symbol error as : symbol lookup error: lib_c_code.so: undefined symbol: cython_function
Here lib_c_code.so is the shared library created out of the external.c file above. I have tried including the .h file created by the cython compiler in external.c but it still didn't work out.. Here is how I am compiling lib_c_code.so:
gcc -shared -dynlib -lm -W1 -o lib_c_code.so $(OBJDIR)/*.o -lc -lm -lpy_code
and the libpy_code.so is the shared object file that was created out of the cython_file.pyx file as:
cython cython_file.pyx -o cython_file.c
gcc $(IFLAGS) -I/usr/include/python2.6 -fPIC -shared cython_file.c -lpython2.6 -lm -o libpy_code.so
Also, I can see the symbol cython_function in the lib_c_code.so file when I do : nm -g lib_c_code.so..
Any ideas please?
I have to guess here that there's a callback registration function to which you can pass the function pointer, in which case you can simply forego the C file and define a cdef function directly in your Cython code, and pass that with the callback registration function. Use with gil in case you manipulate any Python objects in it.
cdef extern from "external.h":
ctypedef int (*Cb_Func)(int param1, int param2)
void register_callback(Cb_Func func)
cdef int my_callback(int param1,int param2) with gil:
<implementation>
register_callback(my_callback)
This is also explained in the Cython user manual here: http://docs.cython.org/src/userguide/external_C_code.html
I am developing a CUDA 4.0 application running on a Fermi card. According to the specs, Fermi has Compute Capability 2.0 and therefore should support non-inlined function calls.
I compile every class I have with nvcc 4.0 in a distinct obj file. Then, I link them all with g++-4.4.
Consider the following code :
[File A.cuh]
#include <cuda_runtime.h>
struct A
{
__device__ __host__ void functionA();
};
[File B.cuh]
#include <cuda_runtime.h>
struct B
{
__device__ __host__ void functionB();
};
[File A.cu]
#include "A.cuh"
#include "B.cuh"
void A::functionA()
{
B b;
b.functionB();
}
Attempting to compile A.cu with nvcc -o A.o -c A.cu -arch=sm_20 outputs Error: External calls are not supported (found non-inlined call to _ZN1B9functionBEv).
I must be doing something wrong, but what ?
As explained on this thread on the NVidia forums, it appears that even though Fermi supports non-inlined functions, nvcc still needs to have all the functions available during compilation, i.e. in the same source file: there is no linker (yep, that's a pity...).
functionB is not declared and therefore considered external call. As the error said external calls are not supported. Implement functionB and it will work.
True, CUDA 5.0 does it. I can't get it to expose external device variables but device methods work just fine. Not by default.
The nvcc option is "-rdc=true". In Visual Studio and Nsight it is an option in the project properties under Configuration Properties -> CUDA C/C++ -> Common -> Generate Relocatable Device Code.