Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h - cuda

I'm starting to program with CUDA, and in some examples I find the include files cuda.h, cuda_runtime.h and cuda_runtime_api.h included in the code. Can someone explain to me the difference between these files?

In very broad terms:
cuda.h defines the public host
functions and types for the CUDA
driver API.
cuda_runtime_api.h defines the public
host functions and types for the
CUDA runtime API
cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type
definitions and function overlays for the CUDA language extensions and
device intrinsic functions.
If you were writing host code to be compiled with the host compiler which includes API calls, you would include either cuda.h or cuda_runtime_api.h. If you needed other CUDA language built-ins, like types, and were using the runtime API and compiling with the host compiler, you would include cuda_runtime.h. If you are writing code which will be compiled using nvcc, it is all irrelevant, because nvcc takes care of inclusion of all the required headers automatically without programmer intervention.

A few observations in addition to #talonmies answer:
cuda_runtime.h includes cuda_runtime_api.h internally, but not the other way around. So: "runtime includes all of runtime_api" is a mnemonic to remember.
cuda_runtime_api.h does not have the entire runtime API functions you'll find in the official documentation, while cuda_runtime.h will have it all (example: cudaEventCreate()). However, all API calls defined cuda_runtime.h are actually implemented, in the header file itself, using calls to functions in cuda_runtime_api.h. These are the "function overlays" that #talonmies mentioned.
cuda_runtime_api.h is a C-language header (IIANM) with only C-language function declarations; cuda_runtime.h is a C++ header file, with some templated functions implemented.

Related

Can libcu++ be included in a regular C++ project?

I have tried using <cuda/std/chrono> and <cuda/std/array> inside my C++ project, not compiled with NVCC. The reason I am not making a CUDA project is that NVCC fails to compile some of my template code.
I am getting several errors about undefined constants such as the ones below:
error C2065: '_LInf': undeclared identifier error C3615: constexpr function 'cuda::std::__4::__libcpp_numeric_limits<long double,true>::infinity' cannot result in a constant expression
Can libcu++ be used inside C++ projects or only CUDA projects?
I have already tried including and linking to my C++ project the headers and libraries that are automatically added to CUDA projects by Visual Studio.
Converting comments into an answer:
The system requirements page for libcu++ indicate that only NVIDIA’s toolchains (so nvcc or the HPC SDK compilers) are supported.
More generally, libcu++ is intended to implement a subset of the C++ standard library in a transparent way which allows it to be used identically in host and device code without any programmer intervention beyond include a header and respecting its namespace conventions. It stands to reason that this requires the NVIDIA toolchain to implement this magic.

What does the --abi-compile=yes option of CUDA ptxas do (which costs registers)?

NVIDIA CUDA's PTX optimizing assembler, ptxas, has the following option:
--abi-compile <yes|no> (-abi)
Enable/Disable the compiling of functions using ABI.
Default value: 'yes'.
What ABI is that? And what happens when you disable it? It seems to result in fewer registers being used, hmm...
(Question inspired by this GTC 2011 presentation about register spilling.)
An application binary interface describe how function are called how to interface to libraries etc. What it allows is for instance to have a stack of function calls enabling for instance to call kernels from kernels, link libraries. All these features cost some registers for (mantaing the stack frame). ABIs are what makes modern software work, and programmers typically cannot opt out of them.
You can still turn abi off (and save some register), but keep in mind that linking external function as printf() won't work anymore.
Here is the link to the official CUDA documentation about ABI and ptxas. It will answer all your questions.

Given a pointer to a __global__ function, can I retrieve its name?

Suppose I have a pointer to a __global__ function in CUDA. Is there a way to programmatically ask CUDART for a string containing its name?
I don't believe this is possible by any public API.
I have previously tried poking around in the driver itself, but that doesn't look too promising. The compiler emitted code for <<< >>> kernel invocation clearly registers the mangled function name with the runtime via __cudaRegisterFunction, but I couldn't see any obvious way to perform a lookup by name/value in the runtime library. The driver API equivalent cuModuleGetFunction leads to an equally opaque type from which it doesn't seem possible to extract the function name.
Edited to add:
The host compiler itself doesn't support reflection, so there are no obvious fancy language tricks that could be pulled at runtime. One possibility would be to add another preprocessor pass to the compilation trajectory to build a static kernel function lookup table before the final build. That would be rather a lot of work, but it could be done, at least for "classic" compilation where everything winds up in a single translation unit.

Standard Fortran interface for cuBLAS

I am using a commercial simulation software on Linux that does intensive matrix manipulation. The software uses Intel MKL by default, but it allows me to replace it with a custom BLAS/LAPACK library. This library must be a shared object (.so) library and must export both BLAS and LAPACK standard routines. The software requires the standard Fortran interface for all of them.
To verify that I can use a custom library, I compiled ATLAS and linked LAPACK (from netlib) inside it. The software was able to use my compiled ATLAS version without any problems.
Now, I want to make the software use cuBLAS in order to enhance the simulation speed. I was confronted by the problem that cuBLAS doesn't export the standard BLAS function names (they have a cublas prefix). Moreover, the library cuBLAS library doesn't include LAPACK routines.
I use readelf -a to check for the exported function.
On another hand, I tried to use MAGMA to solve this problem. I succeeded to compile and link it against all of ATLAS, LAPACK and cuBLAS. But still it doesn't export the correct functions and doesn't include LAPACK in the final shared object. I am not sure if this is the way it is supposed to be or I did something wrong during the build process.
I have also found CULA, but I am not sure if this will solve the problem or not.
Did anybody tried to get cuBLAS/LAPACK (or a proper wrapper) linked into a single (.so) exporting the standard Fortran interface with the correct function names? I believe it is conceptually possible, but I don't know how to do it!
Updated
As indicated by #talonmies, CUDA has provided a fortran thunking wrapper interface.
http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings
You should be able to run your application with it. But you probably will not get any performance improvement due to the mem alloc/copy issue described below.
Old
It may not easy. CUBLAS and other CUDA library interfaces assume all the data are already stored in device memory, however in your case, all the data are still in CPU RAM before calling.
You may have to write your own wrapper to deal with it like
void dgemm(...) {
copy_data_from_cpu_ram_to_gpu_mem();
cublas_dgemm(...);
copy_data_from_gpu_mem_to_cpu_ram();
}
On the other hand, you probably have noticed that every single BLAS call requires 2 data copies. This may introduce huge overhead and slow down the overall performance, unless most of your callings are BLAS 3 operations.

Using STL containers in GNU Assembler

Is it possible to "link" the STL to an assembly program, e.g. similar to linking the glibc to use functions like strlen, etc.? Specifically, I want to write an assembly function which takes as an argument a std::vector and will be part of a lib. If this is possible, is there any documentation on this?
Any use of C++ templates will require the compiler to generate instantiations of those templates. So you don't really "link" something like the STL into a program; the compiler generates object code based upon your use of templates in the library.
However, if you can write some C++ code that forces the templates to be instantiated for whatever types and other arguments you need to use, then write some C-linkage functions to wrap the uses of those template instantiations, then you should be able to call those from your assembly code.
I strongly believe you're doing it wrong. Using assembler is not going to speed up your handling of the data. If you must use existing assembly code, simply pass raw buffers
std::vector is by definition (in the standard) compatible with raw buffers (arrays); the standard mandates contiguous allocation. Only reallocation can invalidate the memory region that contains the element data. In short, if the C++ code can know the (max) capacity required and reserve()/resize() appropriately, you can pass &vector[0] as the buffer address and be perfectly happy.
If the assembly code needs to decide how (much) to reallocate, let it use malloc. Once done, you should be able to use that array as STL container:
std::accumulate(buf, buf+n, 0, &dosomething);
Alternatively, you can use the fact that std::tr1::array<T, n> or boost::array<T, n> are POD, and use placement new right on the buffer allocated in the library (see here: placement new + array +alignment or How to make tr1::array allocate aligned memory?)
Side note
I have the suspicion that you are using assembly for the wrong reasons. Optimizing compilers will leverage the full potential of modern processors (including SIMD such as SSE1-4);
E.g. for gcc have a look at
__attibute__ (e.g. for pointer restrictions
such as alignment and aliasing guarantees: this will enable the more powerful vectorization options for the compiler);
-ftree_vectorize and -ftree_vectorizer_verbose=2, -march=native
Note also that since the compiler can't be sure what registers an external (or even inline) assembly procedure clobbers, it must assume all registers are clobbered leading to potential performance degradation. See http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html for ways to use inline assembly with proper hints to gcc.
probably completely off-topic: -fopenmp and gnu::parallel
Bonus: the following references on (premature) optimization in assembly and c++ might come in handy:
Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms
Optimizing subroutines in assembly language: An optimization guide for x86 platforms
The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
And some other relevant resources