Before I dive too deep into CUDA programming I need to orient myself. The NVIDIA CUDA programming guides made a distinct change from referring to "CUDA C" to "CUDA C++" between versions 10.1 and 10.2. Since this was a minor version change, I suspect it is just semantics. I compared sample code from pre-10.1 and post-10.2 and found no difference...though that doesn't mean there is no difference. Was there a more subtle programming paradigm shift between these versions?
Here's my suspicion: CUDA has always been an extension of C++, not C, but everyone has referred to it as CUDA C because the we don't take advantage of the OOP offered by C++ when writing CUDA code. Is that a fair assessment?
I think your assessment is reasonable conjecture. People are sometimes imprecise in their references to C and C++, and so CUDA probably hasn't been very rigorous here either. There is some history though that suggests to me this is not purely hand-waving.
CUDA started out as largely a C-style realization, but over time added C++ style features. Certainly by CUDA 4.0 (circa 2010) if not before, there were plenty of C++ style features.
Lately, CUDA drops the reference to C but claims compliance to a particular C++ ISO standard, subject to various enumerated restrictions and limitations.
The CUDA compiler, nvcc, behaves by default like a C++ style compiler (so, for example, using C++ style mangling), and will by default invoke the host-code C++ compiler (e.g. g++) not the host code C compiler (e.g. gcc) when passing off host code to be compiled.
As you point out, a programmer can use the C++ language syntactically in a very similar way to C usage (e.g. without the use of classes, to pick one example). This is also true for CUDA C++.
It's not possible to build Rome in a day, and so CUDA development has proceeded in various areas at various rates. For example, one stated limitation of CUDA is that elements of the standard library (std::) are not necessarily supported in device code. However various CUDA developers are working to gradually fill in this gap with the libcu++ evolution.
Related
Is there any compiled language that has garbage collection built in?
To my understanding right now, the purpose of an interpreter or JVM is to make binaries platform independent. Is it also because of the GC? Or is GC possible in compiled code?
SML, OCaml, Eiffel, D, Go, and Haskell are all statically-typed languages with garbage collection that are typically compiled ahead of time to native code.
As you correctly point out, virtual machines are mostly used to abstract away machine-dependent properties of underlying platforms. Garbage collection is an orthogonal technology. Usually it is not mandatory for a language, but is considered a desired property of a run-time environment. There are indeed languages with primitives to allocate memory (e.g., new in Java and C#) but without primitives to release it. They can be thought of as languages with built-in GC.
One such programming language is Eiffel. Most Eiffel compilers generate C code for portability reasons. This C code is used to produce machine code by a standard C compiler. Eiffel implementations provide GC (and sometimes even accurate GC) for this compiled code, and there is no need for VM. In particular, VisualEiffel compiler generated native x86 machine code directly with full GC support.
Garbage collection is possible in compiled languages.
The Boehm GC is a well known garbage collector for C & C++ - Wikipedia article
Another example is the D programming language has garbage collection
https://nim-lang.org
Nim language has some progress and has good portability as uses C(++), JS & ObjectiveC code generation
I have code written in old-style Fortran 95 for combustion modelling. One of the features of this problem is that one have to solve stiff ODE system for taking into account chemical reactions influence. For this purpouse I use Fortran SLATEC library, which is also quite old. The solving procedure is straight forward, one just need to call subroutine ddriv3 in every cell of computational domain, so that looks something like that:
do i = 1,Number_of_cells ! Number of cells is about 2000
call ddriv3(...) ! All calls are independent on cell number i
end do
ddriv3 is quite complex and utilizes many other library functions.
Is there any way to get an advantage with CUDA Fortran, without searching some another library for this purpose? If I just run this as "parallel loop" is that will be efficient, or may be there is another way?
I'm sorry for such kind of question that immidiately arises the most obvious answer: "Why wouldn't you try and know it by yourself?", but i'm in a really straitened time conditions. I have no any experience in CUDA and I just want to choose the most right and easiest way to start.
Thanks in advance !
You won't be able to use or parallelize the ddriv3 call without some effort. Your usage of the phrase "parallel loop" suggests to me you may be thinking of using OpenACC directives with Fortran, as opposed to CUDA Fortran, but the general answer isn't any different in either case.
The ddriv3 call, being part of a Fortran library (which is presumably compiled for x86 usage) cannot be directly used in either CUDA Fortran (i.e. using CUDA GPU kernels within Fortran) or in OpenACC Fortran, for essentially the same reason: The library code is x86 code and cannot be used on the GPU.
Since presumably you may have access to the source implementation of ddriv3, you might be able to extract the source code, and work on creating a CUDA version of it (or a version that OpenACC won't choke on), but if it uses many other library routines, it may mean that you have to create CUDA (or direct Fortran source, for OpenACC) versions of each of those library calls as well. If you have no experience with CUDA, this might not be what you want to do (I don't know.) If you go down this path, it would certainly imply learning more about CUDA, or at least converting the library calls to direct Fortran source (for an OpenACC version).
For the above reasons, it might make sense to investigate whether a GPU library replacement (or something similar) might exist for the ddriv3 call (but you specifically excluded that option in your question.) There are certainly GPU libraries that can assist in solving ODE's.
I want to start CUDA in C++ and I familiar with C++ , Qt and C#.
But i want to know it's better to use from CUDA libraries -at high level- or CUDA API s -at the lower level- ?
Is it more better that I'm starting from API and dont use of CUDA driver ?
(I start on "cuda by example" for its concepts in parallel)
Since you are familiar with C/C++, you'd better use the higher-level API, CUDA C or C for CUDA, which is more convenient to easy to write, because it consists of a minimal set of extensions to the C language and a runtime library.
The lower-level API, which is the CUDA driver API that provides an additional level of control by exposing lower-level concepts, requires more code, is harder to program and debug, but offers a better level of control and is language-independent since it handles binary or assembly code.
See Chapter 3 of CUDA programming guide for more details.
This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.
My purpose of using LAPACK is to calculate the cholesky of a matrix. I am programming in C/C++ in Fedora, but I am confused over which lapack to install - LAPACK with lapacke or clapack?
The basic difference between the two is the need for a Fortran compiler.
CLAPACK is basically just the reference NETLIB LAPACK routines passed through the old f2c converter, allowing the library to be compiled with a C compiler.
LAPACKE is an attempt (started by Intel IIRC) to define a formal C language interface for Fortran LAPACK libraries. It has the advantage that it is LAPACK implementation independent and will hide toolchain specific C to Fortran interoperability so that the programmer doesn't have to worry about them. LAPACKE also has the distinct advantage of working correctly with the C99 complex intrinsic type.
I would not expect a major performance difference between the two (the choice of BLAS dictates most of that), but I would probably favor LAPACKE + the LAPACK and BLAS implmementation of choice, if I were to start from scratch today.