General Numba AOT output files - can we generate C++ or Cython like files from an AOT compiled function? - output

Question: Looking for someone that is more familiar with Numba AOT and any "output intermediate files" option I haven't found yet.
First off -> Pythran is a Python to C++ to PYD compiler. You can output the cpp files with a simple -e option to the compiler. It also takes #pragma omp flags like #omp parallel for. Now the more talked about Numba is used in JIT mode mostly. But it has an option to AOT (ahead of time) compile modules, but they lose parallel optimizations.
So I am using both of these libraries in a project. The Numba AOT compiled function is much faster than the Pythran one (and the Pythran one is using #omp parallel for loops while Numba AOT doesn't). Nonetheless, the Numba version is faster, so it's doing something my Pythran program is not.
So I want to see what it's doing - but when I look at the source code for Numba's AOT from numba.pycc import CC it appears Numba somehow is actually generating byte code and then compiles that into a PYD. But the documentation doesn't state this is being done anywhere, or if it's even possible to get the preliminary files Numba generates prior to compiling to examine. If they are in bytecode, well, I can't read that anyway. BUT if they are in either Cython or CPP format, then it's easy to examine the optimizations being done. SO...
Looking for someone that is more familiar with Numba AOT and any "output intermediate files" option I haven't found yet.
Even if the answer is "NO YOU CAN'T DO THAT" I need to hear it, then I can focus on parts of the code I can actually change.
Appreciated!

Related

Is there a way to specify __device__ for an entire file? (Nvidia Cuda Compiler)

I am importing a library and I get this error when compiling:
go.cu(61): error: calling a __host__ function("TinyJS::Interpreter::Interpreter()") from a __global__ function("capnduk_kernel") is not allowed
...is there a way to port an entire file (TinyJS) to run on the device?
I've checked the compiler documentation, and it doesn't look like there's a way to do this. I'm guessing the only way is to rewrite the file by hand, which is a can of worms.
There isn't a way to do this with nvcc. It will require manual effort.
While NVCC does not support this (as Robert points out), this is an option for run-time compilation, via the NVRTC library:
Documentation lists the following compilation option:
--device-as-default-execution-space (-default-device)
Treat entities with no execution space annotation as __device__ entities.
Notes:
With this being the case, I would consider submitting a bug report to NVIDIA and asking them to add this option to NVCC.
clang++ supports compiling CUDA, perhaps it has such a flag.
This NVRTC is also supported by the Modern-C++ wrappers library for CUDA, which is more convenient to use than working with NVRTC directly. (Caveat: That's my own library.)

Several questions about CUDA and cuPrintf

I can compile successfully my code using cuPrintf by nvcc, but cannot compile it in Visual Studio 2012 environment. It says that "volatile char *" cannot be changed to "const void *" in "cudaMemcpyToSymbol" function.
cuPrintf seems doesn't work, there's no cuPrintf function executed in kernel code.
How to make nvcc export pdb file?
Is there any other convenient way to debug in kernel function? I have only one laptop.
1st , cuPrinft is deprecated (As far as I know it has never been released) you can print data from kernel using print command, but this is a very not recommended way of debugging your kernels.
2nd, You are compiling using CUDA nvcc compiler, there is no such thing pdb file in CUDA, Albeit watch the 'g' and 'G' flags, those may dramatically increase your running time.
3rd,
The best way to debug kernels is using visual Nsight

Standard Fortran interface for cuBLAS

I am using a commercial simulation software on Linux that does intensive matrix manipulation. The software uses Intel MKL by default, but it allows me to replace it with a custom BLAS/LAPACK library. This library must be a shared object (.so) library and must export both BLAS and LAPACK standard routines. The software requires the standard Fortran interface for all of them.
To verify that I can use a custom library, I compiled ATLAS and linked LAPACK (from netlib) inside it. The software was able to use my compiled ATLAS version without any problems.
Now, I want to make the software use cuBLAS in order to enhance the simulation speed. I was confronted by the problem that cuBLAS doesn't export the standard BLAS function names (they have a cublas prefix). Moreover, the library cuBLAS library doesn't include LAPACK routines.
I use readelf -a to check for the exported function.
On another hand, I tried to use MAGMA to solve this problem. I succeeded to compile and link it against all of ATLAS, LAPACK and cuBLAS. But still it doesn't export the correct functions and doesn't include LAPACK in the final shared object. I am not sure if this is the way it is supposed to be or I did something wrong during the build process.
I have also found CULA, but I am not sure if this will solve the problem or not.
Did anybody tried to get cuBLAS/LAPACK (or a proper wrapper) linked into a single (.so) exporting the standard Fortran interface with the correct function names? I believe it is conceptually possible, but I don't know how to do it!
Updated
As indicated by #talonmies, CUDA has provided a fortran thunking wrapper interface.
http://docs.nvidia.com/cuda/cublas/index.html#appendix-b-cublas-fortran-bindings
You should be able to run your application with it. But you probably will not get any performance improvement due to the mem alloc/copy issue described below.
Old
It may not easy. CUBLAS and other CUDA library interfaces assume all the data are already stored in device memory, however in your case, all the data are still in CPU RAM before calling.
You may have to write your own wrapper to deal with it like
void dgemm(...) {
copy_data_from_cpu_ram_to_gpu_mem();
cublas_dgemm(...);
copy_data_from_gpu_mem_to_cpu_ram();
}
On the other hand, you probably have noticed that every single BLAS call requires 2 data copies. This may introduce huge overhead and slow down the overall performance, unless most of your callings are BLAS 3 operations.

How can I read the PTX?

I am working with Capabilities 3.5, CUDA 5 and VS 2010 (and obviously Windows).
I am interested in reading the compiled code to understand better the implication of my C code changes.
What configuration do I need in VS to compile the code for readability (is setting the compilation to PTX enough?)?
What tool do I need to reverse engineer the generated PTX to be able to read it?
In general, to create a ptx version of a particular .cu file, the command is:
nvcc -ptx mycode.cu
which will generate a mycode.ptx file containing the ptx code corresponding to the file you used. It's probably instructive to use the -src-in-ptx option as well:
nvcc -ptx -src-in-ptx mycode.cu
Which will intersperse the lines of source code with the lines of ptx they correspond to.
To comprehend ptx, start with the documentation
Note that the compiler may generate ptx code that doesn't correspond to the source code very well, or is otherwise confusing, due to optimizations. You may wish (perhaps to gain insight) to compile some test cases using the -G switch as well, to see how the non-optimized version compares.
Since the windows environment may vary from machine to machine, I think it's easier if you just look at the path your particular version of msvc++ is using to invoke nvcc (look at the console output from one of your projects when you compile it) and prepend the commands I give above with that path. I'm not sure there's much utility in trying to build this directly into Visual Studio, unless you have a specific need to compile from ptx to an executable. There are also a few sample codes that have to do with ptx in some fashion.
Also note for completeness that ptx is not actually what's executed by the device (but generally pretty close). It is an intermediate code that can be re-targetted to devices within a family by nvcc or a portion of the compiler that also lives in the GPU driver. To see the actual code executed by the device, we use the executable instead of the source code, and the tool to extract the machine assembly code is:
cuobjdump -sass mycode.exe
Similar caveats about prepending an appropriate path, if needed. I would start with the ptx. I think for what you want to do, it's enough.

Does CUDA use an interpreter or a compiler?

This is a bit of silly question, but I'm wondering if CUDA uses an interpreter or a compiler?
I'm wondering because I'm not quite sure how CUDA manages to get source code to run on two cards with different compute capabilities.
From Wikipedia:
Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64 C compiler.
So, your answer is: it uses a compiler.
And to touch on the reason it can run on multiple cards (source):
CUDA C/C++ provides an abstraction, it's a means for you to express how you want your program to execute. The compiler generates PTX code which is also not hardware specific. At runtime the PTX is compiled for a specific target GPU - this is the responsibility of the driver which is updated every time a new GPU is released.
These official documents CUDA C Programming Guide and The CUDA Compiler Driver (NVCC) explain all the details about the compilation process.
From the second document:
nvcc mimics the behavior of the GNU compiler gcc: it accepts a range
of conventional compiler options, such as for defining macros and
include/library paths, and for steering the compilation process.
Not just limited to cuda , shaders in directx or opengl are also complied to some kind of byte code and converted to native code by the underlying driver.