What does --entry take in CUDA's PTX JIT compiler? - cuda

NVIDIA's CUDA offers a PTX compilation library. One of the supported JIT compilation options for PTX code using the library is
--entry entry,... (-e)
which the documentation describes as:
Specify the entry functions for which code must be generated.
Entry function names for this option must be specified in the mangled name.
How do you "specify in the mangled name"? Is this telling us we need to specify mangled names, or does it mean something else?

This sentence:
must be specified in the mangled name.
should have read "must be specified in the mangled form (in which they appear in the PTX source)".
I'm planning on exposing this functionality in my API wrappers and will make sure to properly reflect this in method/member name(s) and Doyxgen comments.

Related

CUDA constant memory symbols

I am using CUDA 5.0 and I have modules which are compiled separately.
I would like to access the same value in the constant memory from all modules.
The problem is the following, when I define the symbol in each
module the linker claims that the symbol has been redefined.
Is there a workaround or a solution for this problem?
Thank you for helping.
In CUDA separate compilation mode, there is a true linker, and every symbol which is linked into the final device binary payload much be uniquely defined. This means __constant__ memory symbols must be only be defined in one place in all the code which is linked together.
The solution is probably to declare the symbol as extern at every translation unit scope except one, which contains the definition of the symbol. Note that this is the only case where it is valid to use extern with __constant__ symbols, otherwise they are implicitly static. There is a general discussion of the separate compilation model which describes this scenario buried in the documentation (both the programming guide and nvcc manual IIRC).

How can I read the PTX?

I am working with Capabilities 3.5, CUDA 5 and VS 2010 (and obviously Windows).
I am interested in reading the compiled code to understand better the implication of my C code changes.
What configuration do I need in VS to compile the code for readability (is setting the compilation to PTX enough?)?
What tool do I need to reverse engineer the generated PTX to be able to read it?
In general, to create a ptx version of a particular .cu file, the command is:
nvcc -ptx mycode.cu
which will generate a mycode.ptx file containing the ptx code corresponding to the file you used. It's probably instructive to use the -src-in-ptx option as well:
nvcc -ptx -src-in-ptx mycode.cu
Which will intersperse the lines of source code with the lines of ptx they correspond to.
To comprehend ptx, start with the documentation
Note that the compiler may generate ptx code that doesn't correspond to the source code very well, or is otherwise confusing, due to optimizations. You may wish (perhaps to gain insight) to compile some test cases using the -G switch as well, to see how the non-optimized version compares.
Since the windows environment may vary from machine to machine, I think it's easier if you just look at the path your particular version of msvc++ is using to invoke nvcc (look at the console output from one of your projects when you compile it) and prepend the commands I give above with that path. I'm not sure there's much utility in trying to build this directly into Visual Studio, unless you have a specific need to compile from ptx to an executable. There are also a few sample codes that have to do with ptx in some fashion.
Also note for completeness that ptx is not actually what's executed by the device (but generally pretty close). It is an intermediate code that can be re-targetted to devices within a family by nvcc or a portion of the compiler that also lives in the GPU driver. To see the actual code executed by the device, we use the executable instead of the source code, and the tool to extract the machine assembly code is:
cuobjdump -sass mycode.exe
Similar caveats about prepending an appropriate path, if needed. I would start with the ptx. I think for what you want to do, it's enough.

link cuda with gmp

I am trying to use cuda with the GNU multiple precision library (gmp). When I add gmp instructions like mpf_init() to my device code I get this compiler error: tlgmp.cu(37): error: calling a host function("__ gmpf_init") from a __ device__ /__ global__ function("histo") is not allowed.
Is it possible to redefine gmp instructions so that they can can be used in device code?
The GMP library is compiled for the host, and so it can't be used directly in device code. That is the direct reason for the error you are seeing.
Since it's an open-source library, it might be possible with some effort to go through the code and create your own version, that has appropriate __device__ decorators (and possibly other changes) to the various functions you need. This would probably require a substantial amount of work, however.
Another alternative might be to investigate the CUMP library.
Another alternative might be to investigate the xmp library
Another alternative might be to investigate the campary library

Using STL containers in GNU Assembler

Is it possible to "link" the STL to an assembly program, e.g. similar to linking the glibc to use functions like strlen, etc.? Specifically, I want to write an assembly function which takes as an argument a std::vector and will be part of a lib. If this is possible, is there any documentation on this?
Any use of C++ templates will require the compiler to generate instantiations of those templates. So you don't really "link" something like the STL into a program; the compiler generates object code based upon your use of templates in the library.
However, if you can write some C++ code that forces the templates to be instantiated for whatever types and other arguments you need to use, then write some C-linkage functions to wrap the uses of those template instantiations, then you should be able to call those from your assembly code.
I strongly believe you're doing it wrong. Using assembler is not going to speed up your handling of the data. If you must use existing assembly code, simply pass raw buffers
std::vector is by definition (in the standard) compatible with raw buffers (arrays); the standard mandates contiguous allocation. Only reallocation can invalidate the memory region that contains the element data. In short, if the C++ code can know the (max) capacity required and reserve()/resize() appropriately, you can pass &vector[0] as the buffer address and be perfectly happy.
If the assembly code needs to decide how (much) to reallocate, let it use malloc. Once done, you should be able to use that array as STL container:
std::accumulate(buf, buf+n, 0, &dosomething);
Alternatively, you can use the fact that std::tr1::array<T, n> or boost::array<T, n> are POD, and use placement new right on the buffer allocated in the library (see here: placement new + array +alignment or How to make tr1::array allocate aligned memory?)
Side note
I have the suspicion that you are using assembly for the wrong reasons. Optimizing compilers will leverage the full potential of modern processors (including SIMD such as SSE1-4);
E.g. for gcc have a look at
__attibute__ (e.g. for pointer restrictions
such as alignment and aliasing guarantees: this will enable the more powerful vectorization options for the compiler);
-ftree_vectorize and -ftree_vectorizer_verbose=2, -march=native
Note also that since the compiler can't be sure what registers an external (or even inline) assembly procedure clobbers, it must assume all registers are clobbered leading to potential performance degradation. See http://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html for ways to use inline assembly with proper hints to gcc.
probably completely off-topic: -fopenmp and gnu::parallel
Bonus: the following references on (premature) optimization in assembly and c++ might come in handy:
Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms
Optimizing subroutines in assembly language: An optimization guide for x86 platforms
The microarchitecture of Intel, AMD and VIA CPUs: An optimization guide for assembly programmers and compiler makers
And some other relevant resources

Difference between cuda.h, cuda_runtime.h, cuda_runtime_api.h

I'm starting to program with CUDA, and in some examples I find the include files cuda.h, cuda_runtime.h and cuda_runtime_api.h included in the code. Can someone explain to me the difference between these files?
In very broad terms:
cuda.h defines the public host
functions and types for the CUDA
driver API.
cuda_runtime_api.h defines the public
host functions and types for the
CUDA runtime API
cuda_runtime.h defines everything cuda_runtime_api.h does, as well as built-in type
definitions and function overlays for the CUDA language extensions and
device intrinsic functions.
If you were writing host code to be compiled with the host compiler which includes API calls, you would include either cuda.h or cuda_runtime_api.h. If you needed other CUDA language built-ins, like types, and were using the runtime API and compiling with the host compiler, you would include cuda_runtime.h. If you are writing code which will be compiled using nvcc, it is all irrelevant, because nvcc takes care of inclusion of all the required headers automatically without programmer intervention.
A few observations in addition to #talonmies answer:
cuda_runtime.h includes cuda_runtime_api.h internally, but not the other way around. So: "runtime includes all of runtime_api" is a mnemonic to remember.
cuda_runtime_api.h does not have the entire runtime API functions you'll find in the official documentation, while cuda_runtime.h will have it all (example: cudaEventCreate()). However, all API calls defined cuda_runtime.h are actually implemented, in the header file itself, using calls to functions in cuda_runtime_api.h. These are the "function overlays" that #talonmies mentioned.
cuda_runtime_api.h is a C-language header (IIANM) with only C-language function declarations; cuda_runtime.h is a C++ header file, with some templated functions implemented.