Slatec + CUDA Fortran - cuda

I have code written in old-style Fortran 95 for combustion modelling. One of the features of this problem is that one have to solve stiff ODE system for taking into account chemical reactions influence. For this purpouse I use Fortran SLATEC library, which is also quite old. The solving procedure is straight forward, one just need to call subroutine ddriv3 in every cell of computational domain, so that looks something like that:
do i = 1,Number_of_cells ! Number of cells is about 2000
call ddriv3(...) ! All calls are independent on cell number i
end do
ddriv3 is quite complex and utilizes many other library functions.
Is there any way to get an advantage with CUDA Fortran, without searching some another library for this purpose? If I just run this as "parallel loop" is that will be efficient, or may be there is another way?
I'm sorry for such kind of question that immidiately arises the most obvious answer: "Why wouldn't you try and know it by yourself?", but i'm in a really straitened time conditions. I have no any experience in CUDA and I just want to choose the most right and easiest way to start.
Thanks in advance !

You won't be able to use or parallelize the ddriv3 call without some effort. Your usage of the phrase "parallel loop" suggests to me you may be thinking of using OpenACC directives with Fortran, as opposed to CUDA Fortran, but the general answer isn't any different in either case.
The ddriv3 call, being part of a Fortran library (which is presumably compiled for x86 usage) cannot be directly used in either CUDA Fortran (i.e. using CUDA GPU kernels within Fortran) or in OpenACC Fortran, for essentially the same reason: The library code is x86 code and cannot be used on the GPU.
Since presumably you may have access to the source implementation of ddriv3, you might be able to extract the source code, and work on creating a CUDA version of it (or a version that OpenACC won't choke on), but if it uses many other library routines, it may mean that you have to create CUDA (or direct Fortran source, for OpenACC) versions of each of those library calls as well. If you have no experience with CUDA, this might not be what you want to do (I don't know.) If you go down this path, it would certainly imply learning more about CUDA, or at least converting the library calls to direct Fortran source (for an OpenACC version).
For the above reasons, it might make sense to investigate whether a GPU library replacement (or something similar) might exist for the ddriv3 call (but you specifically excluded that option in your question.) There are certainly GPU libraries that can assist in solving ODE's.

Related

Normal Cuda Vs CuBLAS?

Just of curiosity. CuBLAS is a library for basic matrix computations. But these computations, in general, can also be written in normal Cuda code easily, without using CuBLAS. So what is the major difference between the CuBLAS library and your own Cuda program for the matrix computations?
We highly recommend developers use cuBLAS (or cuFFT, cuRAND, cuSPARSE, thrust, NPP) when suitable for many reasons:
We validate correctness across every supported hardware platform, including those which we know are coming up but which maybe haven't been released yet. For complex routines, it is entirely possible to have bugs which show up on one architecture (or even one chip) but not on others. This can even happen with changes to the compiler, the runtime, etc.
We test our libraries for performance regressions across the same wide range of platforms.
We can fix bugs in our code if you find them. Hard for us to do this with your code :)
We are always looking for which reusable and useful bits of functionality can be pulled into a library - this saves you a ton of development time, and makes your code easier to read by coding to a higher level API.
Honestly, at this point, I can probably count on one hand the number of developers out there who actually implement their own dense linear algebra routines rather than calling cuBLAS. It's a good exercise when you're learning CUDA, but for production code it's usually best to use a library.
(Disclosure: I run the CUDA Library team)
There's several reasons you'd chose to use a library instead of writing your own implementation. Three, off the top of my head:
You don't have to write it. Why do work when somebody else has done it for you?
It will be optimised. NVIDIA supported libraries such as cuBLAS are likely to be optimised for all current GPU generations, and later releases will be optimised for later generations. While most BLAS operations may seem fairly simple to implement, to get peak performance you have to optimise for hardware (this is not unique to GPUs). A simple implementation of SGEMM, for example, may be many times slower than an optimised version.
They tend to work. There's probably less chance you'll run up against a bug in a library then you'll create a bug in your own implementation which bites you when you change some parameter or other in the future.
The above isn't just relevent to cuBLAS: if you have a method that's in a well supported library you'll probably save a lot of time and gain a lot of performance using it relative to using your own implementation.

Standard Fortran interface for cuBLAS

I am using a commercial simulation software on Linux that does intensive matrix manipulation. The software uses Intel MKL by default, but it allows me to replace it with a custom BLAS/LAPACK library. This library must be a shared object (.so) library and must export both BLAS and LAPACK standard routines. The software requires the standard Fortran interface for all of them.
To verify that I can use a custom library, I compiled ATLAS and linked LAPACK (from netlib) inside it. The software was able to use my compiled ATLAS version without any problems.
Now, I want to make the software use cuBLAS in order to enhance the simulation speed. I was confronted by the problem that cuBLAS doesn't export the standard BLAS function names (they have a cublas prefix). Moreover, the library cuBLAS library doesn't include LAPACK routines.
I use readelf -a to check for the exported function.
On another hand, I tried to use MAGMA to solve this problem. I succeeded to compile and link it against all of ATLAS, LAPACK and cuBLAS. But still it doesn't export the correct functions and doesn't include LAPACK in the final shared object. I am not sure if this is the way it is supposed to be or I did something wrong during the build process.
I have also found CULA, but I am not sure if this will solve the problem or not.
Did anybody tried to get cuBLAS/LAPACK (or a proper wrapper) linked into a single (.so) exporting the standard Fortran interface with the correct function names? I believe it is conceptually possible, but I don't know how to do it!
Updated
As indicated by #talonmies, CUDA has provided a fortran thunking wrapper interface.
http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings
You should be able to run your application with it. But you probably will not get any performance improvement due to the mem alloc/copy issue described below.
Old
It may not easy. CUBLAS and other CUDA library interfaces assume all the data are already stored in device memory, however in your case, all the data are still in CPU RAM before calling.
You may have to write your own wrapper to deal with it like
void dgemm(...) {
copy_data_from_cpu_ram_to_gpu_mem();
cublas_dgemm(...);
copy_data_from_gpu_mem_to_cpu_ram();
}
On the other hand, you probably have noticed that every single BLAS call requires 2 data copies. This may introduce huge overhead and slow down the overall performance, unless most of your callings are BLAS 3 operations.

CUDA: calling library function in kernel

I know that there is the restriction to call only __device__ functions in the kernel. This prevents me from calling standard functions like strcmp() and so on in the kernel.
At this point I am not able to understand/find the reasons for this. Could not the compiler just follow each includes in strings.h and so on while inlining the calls to strcmp() in the kernel? I guess the reason I am looking for is easy and I am missing something here.
Is it the only way to reimplement all the functions and datatypes I need in kernel computation? Is there a codebase with such reimplementations?
Yes, the only way to use stdlib's functions from kernel is to reimplement them. But I strongly advice you to reconsider this idea, since it's highly unlikely you would need to run code that uses strcmp() on GPU. Please, add additional details about your problem, so a better solution could be proposed (I highly doubt that serial string comparison on GPU is what you really need).
It's barely possible to simply recompile all stdlib for GPU, since it depends a lot on some system calls (like memory allocation), which could not be used on GPU (well, in recent versions of CUDA toolkit you can allocate device memory from kernel, but it's not "cuda-way", is supported only by newest hardware and is very bad for performance).
Besides, CPU versions of most functions is far from being "good" for GPUs. So, in vast majority of cases compiling your ordinary CPU functions for GPU would lead to no good, so the compiler doesn't even try it.
Standard functions like strcmp() have not been compiled for the CUDA architecture. I have not seen any standard C libraries for CUDA.

how to use my existing .cpp code with cuda

I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?
The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.
CUDA by Example: An Introduction to General-Purpose GPU Programming
Programming Massively Parallel Processors: A Hands-on Approach
These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.
Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.
Here are two tutorials on getting started with CUDA and Visual C++ 2010:
http://www.ademiller.com/blogs/tech/2011/03/using-cuda-and-thrust-with-visual-studio-2010/
http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
There's also a post on the NVIDIA forum:
http://forums.nvidia.com/index.php?showtopic=184539
Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.
It's a non-trivial task to convert a program from straight C(++) to CUDA. As far as I know, it is possible to use C++ like stuff within CUDA (esp. with the announced CUDA 4.0), but I think it's easier to start with only C stuff (i.e. structs, pointers, elementary data types).
Start by reading the CUDA programming guide and by examining the examples coming with the CUDA SDK or available here. I personally found the vector addition sample quite enlightening. It can be found over here.
I can not tell you how to write your globals and shareds for your specific program, but after reading the introductory material, you will have at least a vague idea of how to do.
The problem is that it is (as far as I know) not possible to tell a generic way of transforming pure C(++) into code suitable for CUDA. But here are some corner stones for you:
Central idea for CUDA: Loops can be transformed into different threads executed multiple times in parallel on the GPU.
Therefore, the single iterations optimally are independent of other iterations.
For optimal execution, the single execution branches of the threads should be (almost) the same, i.e. the single threads sould do almost the same.
You can have multiple .cpp and .cu files in your project. Unless you want your .cu files to contain only device code, it should be fairly easy.
For your .cu files you specify a header file, containing host functions in it. Then, include that header file in other .cu or .cpp files. The linker will do the rest. It is nothing different than having multiple plain C++ .cpp files in your project.
I assume you already have CUDA rule files for your Visual Studio.

how to use my existing .cpp code to write cuda code [duplicate]

I hv code in c++ and wanted to use it along with cuda.Can anyone please help me? Should I provide my code?? Actually I tried doing so but I need some starting code to proceed for my code.I know how to do simple square program (using cuda and c++)for windows(visual studio) .Is it sufficient to do the things for my program?
The following are both good places to start. CUDA by Example is a good tutorial that gets you up and running pretty fast. Programming Massively Parallel Processors includes more background, e.g. chapters on the history of GPU architecture, and generally more depth.
CUDA by Example: An Introduction to General-Purpose GPU Programming
Programming Massively Parallel Processors: A Hands-on Approach
These both talk about CUDA 3.x so you'll want to look at the new features in CUDA 4.x at some point.
Thrust is definitely worth a look if your problem maps onto it well (see comment above). It's an STL-like library of containers, iterators and algorithms that implements data-parallel algorithms on top of CUDA.
Here are two tutorials on getting started with CUDA and Visual C++ 2010:
http://www.ademiller.com/blogs/tech/2011/03/using-cuda-and-thrust-with-visual-studio-2010/
http://blog.cuvilib.com/2011/02/24/how-to-run-cuda-in-visual-studio-2010/
There's also a post on the NVIDIA forum:
http://forums.nvidia.com/index.php?showtopic=184539
Asking very general how do I get started on ... on Stack Overflow generally isn't the best approach. Typically the best reply you'll get is "go read a book or the manual". It's much better to ask specific questions here. Please don't create duplicate questions, it isn't helpful.
It's a non-trivial task to convert a program from straight C(++) to CUDA. As far as I know, it is possible to use C++ like stuff within CUDA (esp. with the announced CUDA 4.0), but I think it's easier to start with only C stuff (i.e. structs, pointers, elementary data types).
Start by reading the CUDA programming guide and by examining the examples coming with the CUDA SDK or available here. I personally found the vector addition sample quite enlightening. It can be found over here.
I can not tell you how to write your globals and shareds for your specific program, but after reading the introductory material, you will have at least a vague idea of how to do.
The problem is that it is (as far as I know) not possible to tell a generic way of transforming pure C(++) into code suitable for CUDA. But here are some corner stones for you:
Central idea for CUDA: Loops can be transformed into different threads executed multiple times in parallel on the GPU.
Therefore, the single iterations optimally are independent of other iterations.
For optimal execution, the single execution branches of the threads should be (almost) the same, i.e. the single threads sould do almost the same.
You can have multiple .cpp and .cu files in your project. Unless you want your .cu files to contain only device code, it should be fairly easy.
For your .cu files you specify a header file, containing host functions in it. Then, include that header file in other .cu or .cpp files. The linker will do the rest. It is nothing different than having multiple plain C++ .cpp files in your project.
I assume you already have CUDA rule files for your Visual Studio.