Standard Fortran interface for cuBLAS - cuda

I am using a commercial simulation software on Linux that does intensive matrix manipulation. The software uses Intel MKL by default, but it allows me to replace it with a custom BLAS/LAPACK library. This library must be a shared object (.so) library and must export both BLAS and LAPACK standard routines. The software requires the standard Fortran interface for all of them.
To verify that I can use a custom library, I compiled ATLAS and linked LAPACK (from netlib) inside it. The software was able to use my compiled ATLAS version without any problems.
Now, I want to make the software use cuBLAS in order to enhance the simulation speed. I was confronted by the problem that cuBLAS doesn't export the standard BLAS function names (they have a cublas prefix). Moreover, the library cuBLAS library doesn't include LAPACK routines.
I use readelf -a to check for the exported function.
On another hand, I tried to use MAGMA to solve this problem. I succeeded to compile and link it against all of ATLAS, LAPACK and cuBLAS. But still it doesn't export the correct functions and doesn't include LAPACK in the final shared object. I am not sure if this is the way it is supposed to be or I did something wrong during the build process.
I have also found CULA, but I am not sure if this will solve the problem or not.
Did anybody tried to get cuBLAS/LAPACK (or a proper wrapper) linked into a single (.so) exporting the standard Fortran interface with the correct function names? I believe it is conceptually possible, but I don't know how to do it!

Updated
As indicated by #talonmies, CUDA has provided a fortran thunking wrapper interface.
http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings
You should be able to run your application with it. But you probably will not get any performance improvement due to the mem alloc/copy issue described below.
Old
It may not easy. CUBLAS and other CUDA library interfaces assume all the data are already stored in device memory, however in your case, all the data are still in CPU RAM before calling.
You may have to write your own wrapper to deal with it like
void dgemm(...) {
copy_data_from_cpu_ram_to_gpu_mem();
cublas_dgemm(...);
copy_data_from_gpu_mem_to_cpu_ram();
}
On the other hand, you probably have noticed that every single BLAS call requires 2 data copies. This may introduce huge overhead and slow down the overall performance, unless most of your callings are BLAS 3 operations.

Related

Check if GPU is shared

When the GPU is shared with other processes (e.g. Xorg or other CUDA procs), a CUDA process better should not consume all remaining memory but dynamically grow its usage instead.
(There are various errors you might get indirectly from this, like Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR. But this question is not about that.)
(In TensorFlow, you would use allow_growth=True in the GPU options to accomplish this. But this question is not about that.)
Is there a simple way to check if the GPU is currently used by other processes? (I'm not asking whether it is configured to be used for exclusive access.)
I could parse the output nvidia-smi and look for other processes. But that seems somewhat hacky and maybe not so reliable, and not simple enough.
(My software is using TensorFlow, so if TensorFlow provides such a function, nice. But if not, I don't care if this would be a C API or Python function. I would prefer to avoid other external dependencies though, except those I'm anyway using, like CUDA itself, or TensorFlow. I'm not afraid to use ctypes. So consider this question language invariant.)
There is nvmlDeviceGetComputeRunningProcesses and nvmlDeviceGetGraphicsRunningProcesses. (Documentation.)
This is a C API, but I could use pynvml if I don't care about the extra dependency.
Example usage (via).

How to disable or remove numba and cuda from python project?

i've cloned a "PointPillars" repo for 3D detection using just point cloud as input. But when I came to run it, I noted it use cuda and numba. With any prior knowledge about these two, I'm asking if there is any way to remove or disable numba and cuda. I want to run it on local server with CPU only, so I want your advice to solve.
The actual code matters here.
If the usage is only of vectorize or guvectorize using the target=cuda parameter, then "removal" of CUDA should be trivial. Just remove the target parameter.
However if there is use of the #cuda.jit decorator, or explicit copying of data between host and device, then other code refactoring would be involved. There is no simple answer here in that case, the code would have to be converted to an alternate serial or parallel realization via refactoring or porting.

NVRTC and __device__ functions

I am trying to optimize my simulator by leveraging run-time compilation. My code is pretty long and complex, but I identified a specific __device__ function whose performances can be strongly improved by removing all global memory accesses.
Does CUDA allow the dynamic compilation and linking of a single __device__ function (not a __global__), in order to "override" an existing function?
I am pretty sure the really short answer is no.
Although CUDA has dynamic/JIT device linker support, it is important to remember that the linkage process itself is still static.
So you can't delay load a particular function in an existing compiled GPU payload at runtime as you can in a conventional dynamic link loading environment. And the linker still requires that a single instance of all code objects and symbols be present at link time, whether that is a priori or at runtime. So you would be free to JIT link together precompiled objects with different versions of the same code, as long as a single instance of everything is present when the session is finalised and the code is loaded into the context. But that is as far as you can go.
It looks like you have a "main" kernel with a part that is "switchable" at run time.
You can definitely do this using nvrtc. You'd need to go about doing something like this:
Instead of compiling the main kernel ahead of time, store it as as string to be compiled and linked at runtime.
Let's say the main kernel calls "myFunc" which is a device kernel that is chosen at runtime.
You can generate the appropriate "myFunc" kernel based on equations at run time.
Now you can create an nvrtc program using multiple sources using nvrtcCreateProgram.
That's about it. The key is to delay compiling the main kernel until you need it at run time. You may also want to cache your kernels somehow so you end up compiling only once.
There is one problem I foresee. nvrtc may not find the curand device calls which may cause some issues. One work around would be to look at the header the device function call is in and use nvcc to compile the appropriate device kernel to ptx. You can store the resulting ptx as text and use cuLinkAddData to link with your module. You can find more information in this section.

Slatec + CUDA Fortran

I have code written in old-style Fortran 95 for combustion modelling. One of the features of this problem is that one have to solve stiff ODE system for taking into account chemical reactions influence. For this purpouse I use Fortran SLATEC library, which is also quite old. The solving procedure is straight forward, one just need to call subroutine ddriv3 in every cell of computational domain, so that looks something like that:
do i = 1,Number_of_cells ! Number of cells is about 2000
call ddriv3(...) ! All calls are independent on cell number i
end do
ddriv3 is quite complex and utilizes many other library functions.
Is there any way to get an advantage with CUDA Fortran, without searching some another library for this purpose? If I just run this as "parallel loop" is that will be efficient, or may be there is another way?
I'm sorry for such kind of question that immidiately arises the most obvious answer: "Why wouldn't you try and know it by yourself?", but i'm in a really straitened time conditions. I have no any experience in CUDA and I just want to choose the most right and easiest way to start.
Thanks in advance !
You won't be able to use or parallelize the ddriv3 call without some effort. Your usage of the phrase "parallel loop" suggests to me you may be thinking of using OpenACC directives with Fortran, as opposed to CUDA Fortran, but the general answer isn't any different in either case.
The ddriv3 call, being part of a Fortran library (which is presumably compiled for x86 usage) cannot be directly used in either CUDA Fortran (i.e. using CUDA GPU kernels within Fortran) or in OpenACC Fortran, for essentially the same reason: The library code is x86 code and cannot be used on the GPU.
Since presumably you may have access to the source implementation of ddriv3, you might be able to extract the source code, and work on creating a CUDA version of it (or a version that OpenACC won't choke on), but if it uses many other library routines, it may mean that you have to create CUDA (or direct Fortran source, for OpenACC) versions of each of those library calls as well. If you have no experience with CUDA, this might not be what you want to do (I don't know.) If you go down this path, it would certainly imply learning more about CUDA, or at least converting the library calls to direct Fortran source (for an OpenACC version).
For the above reasons, it might make sense to investigate whether a GPU library replacement (or something similar) might exist for the ddriv3 call (but you specifically excluded that option in your question.) There are certainly GPU libraries that can assist in solving ODE's.

Using STL containers in GNU Assembler

Is it possible to "link" the STL to an assembly program, e.g. similar to linking the glibc to use functions like strlen, etc.? Specifically, I want to write an assembly function which takes as an argument a std::vector and will be part of a lib. If this is possible, is there any documentation on this?
Any use of C++ templates will require the compiler to generate instantiations of those templates. So you don't really "link" something like the STL into a program; the compiler generates object code based upon your use of templates in the library.
However, if you can write some C++ code that forces the templates to be instantiated for whatever types and other arguments you need to use, then write some C-linkage functions to wrap the uses of those template instantiations, then you should be able to call those from your assembly code.
I strongly believe you're doing it wrong. Using assembler is not going to speed up your handling of the data. If you must use existing assembly code, simply pass raw buffers
std::vector is by definition (in the standard) compatible with raw buffers (arrays); the standard mandates contiguous allocation. Only reallocation can invalidate the memory region that contains the element data. In short, if the C++ code can know the (max) capacity required and reserve()/resize() appropriately, you can pass &vector[0] as the buffer address and be perfectly happy.
If the assembly code needs to decide how (much) to reallocate, let it use malloc. Once done, you should be able to use that array as STL container:
std::accumulate(buf, buf+n, 0, &dosomething);
Alternatively, you can use the fact that std::tr1::array<T, n> or boost::array<T, n> are POD, and use placement new right on the buffer allocated in the library (see here: placement new + array +alignment or How to make tr1::array allocate aligned memory?)
Side note
I have the suspicion that you are using assembly for the wrong reasons. Optimizing compilers will leverage the full potential of modern processors (including SIMD such as SSE1-4);
E.g. for gcc have a look at
__attibute__ (e.g. for pointer restrictions
such as alignment and aliasing guarantees: this will enable the more powerful vectorization options for the compiler);
-ftree_vectorize and -ftree_vectorizer_verbose=2, -march=native
Note also that since the compiler can't be sure what registers an external (or even inline) assembly procedure clobbers, it must assume all registers are clobbered leading to potential performance degradation. See http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html for ways to use inline assembly with proper hints to gcc.
probably completely off-topic: -fopenmp and gnu::parallel
Bonus: the following references on (premature) optimization in assembly and c++ might come in handy:
Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms
Optimizing subroutines in assembly language: An optimization guide for x86 platforms
The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
And some other relevant resources