CUDA compile multiple .cu files to one file - cuda

I am porting some calculation from C# to CUDA.
There many classes in C# which I want to port, for each c# class I create .cu and .cuh file in my CUDA project.
All classes related, and all they used in calculations.
I need to save structure of my C# code, because it will be very easy to made error in other case.
P.S. In case I put all code in one file - everything works as expected but read or fix some issues becomes real pain.
I want to compile CUDA project and use it in my C# via ManagedCuda library.
I can compile test CUDA project with one .cu file to .ptx file, load it in C# via ManagedCuda and call function from it.
But when I want to compile my real projects with multiple cu files, in result I got multiple .ptx files for each .cu file in project, even more I am not able to load this .ptx file via ManagedCuda, I got next error:
ErrorInvalidPtx: This indicates that a PTX JIT compilation failed.
But this error expected, because there cross reference in ptx files, and they have sense only if the loaded together.
My goal is to compile my CUDA project to one file, but in same time I do not want to be limited to only specific video card which I have. For this I need to use PTX(or cubin with ptx included) this PTX file will be compiled for specific device in moment you load it.
I tried to set Generate Relocatable Device Code to Yes (-rdc=true) and compile to PTX and Cubin - result same I get few independent files for each .cu file.

The very short answer is no, you can't do that. The toolchain cannot merged PTX code at the compilation phase.
If you produce multiple PTX files, you will need to use the JIT linker facilities of the CUDA runtime to produce a module which can be loaded into your context. I have no idea whether Managed CUDA supports that or not.
Edit to add that is appears that Managed CUDA does support runtime linking (see here).

Related

Can the PGI compilers output the generated Cuda code to a file

I would like the generated CUDA code to be saved in a file for examination. Is this possible with OpenAcc and PGI compilers?
You should be able to pass -ta=nvidia,keepgpu,keepptx to any of the PGI GPU compilers, which will retain the intermediate code emitted by the toolchain during the build.
Also refer to the command line help, e.g.:
pgcc -help
Note that PGI compilers have moved to a more integrated toolchain recently, which eliminates the generation of CUDA C intermediate source files, so the above approach works but gives you intermediate files that are not C code (they are llvm and ptx). If you want CUDA C intermediate code, you can also add the nollvm option:
-ta=nvidia,keepgpu,keepptx,nollvm
The "kept" files will generally have .gpu and .h extensions for llvm/CUDA C code, and .ptx extension for PTX.

How can I read the PTX?

I am working with Capabilities 3.5, CUDA 5 and VS 2010 (and obviously Windows).
I am interested in reading the compiled code to understand better the implication of my C code changes.
What configuration do I need in VS to compile the code for readability (is setting the compilation to PTX enough?)?
What tool do I need to reverse engineer the generated PTX to be able to read it?
In general, to create a ptx version of a particular .cu file, the command is:
nvcc -ptx mycode.cu
which will generate a mycode.ptx file containing the ptx code corresponding to the file you used. It's probably instructive to use the -src-in-ptx option as well:
nvcc -ptx -src-in-ptx mycode.cu
Which will intersperse the lines of source code with the lines of ptx they correspond to.
To comprehend ptx, start with the documentation
Note that the compiler may generate ptx code that doesn't correspond to the source code very well, or is otherwise confusing, due to optimizations. You may wish (perhaps to gain insight) to compile some test cases using the -G switch as well, to see how the non-optimized version compares.
Since the windows environment may vary from machine to machine, I think it's easier if you just look at the path your particular version of msvc++ is using to invoke nvcc (look at the console output from one of your projects when you compile it) and prepend the commands I give above with that path. I'm not sure there's much utility in trying to build this directly into Visual Studio, unless you have a specific need to compile from ptx to an executable. There are also a few sample codes that have to do with ptx in some fashion.
Also note for completeness that ptx is not actually what's executed by the device (but generally pretty close). It is an intermediate code that can be re-targetted to devices within a family by nvcc or a portion of the compiler that also lives in the GPU driver. To see the actual code executed by the device, we use the executable instead of the source code, and the tool to extract the machine assembly code is:
cuobjdump -sass mycode.exe
Similar caveats about prepending an appropriate path, if needed. I would start with the ptx. I think for what you want to do, it's enough.

In CUDA's PTX file, what is the purpose of the ".file" directive?

From what I understand, CUDA's PTX file is the virtual bytecode that is JIT compiled by the device runtime. This means that the file is cross platform, you can generate the PTX file and it will run on any CUDA compatible device. However, when I read the file in a text editor, I see these directives ".file" which have information about files on the original computer I compiled it for. So I am unsure what the purpose of these directives are. Also, given that my generated PTX files shouldn't be dependent on these files, can these safely be removed? (Like if I wanted to start writing my own PTX generator).
If you inspect the body of your functions, you will most likely find instructions of form
.loc fileNum fileLine
This indicates, that the following code was generated from a line fileLine from file fileNum. The fileNum is an index integer naming a file predeclared by the .file directive you are asking.
This can help you correlate your source code with the produced PTX output.
During JIT compilation PTX is converted into a native GPU machine code. There, those .loc and .file does not appear at all. It has absolutely no impact on the final machine code.

how to extract ptx from cuda exe and some related cuda compiler questions

1)I want to extract ptx code from a CUDA exe and use that kernel code in another program .
Is there a way to identify the kernel ptx code from an exe. I know they are arbitrarily laid out in an exe file data section.
I learnt that in MAC executables the ptx kernels start with .version and ends with a null string. Is there something like that for win exe(PE) files. I guess i need to parse the exe file , gather ptx statements one at a time and group them together as kernels. But I dont know how i would go about it. some help would get me started. I also find a .nvFatBi section in Cuda exe. What is that supposed to be?
2)I also learnt that there are global constructors which register the cubin with the cuda runtime. I dont understand this part completely. Does the function cudaRegisterFatBinary come into play here. If so how can I use this ptx to supply the pointer to the cudaRegisterFatBinary ? I understand i have to compile the ptx to cubin file . is it possible programatically? In short i want to emulate the nvcc itself in some sense.
Try: cuobjdump --dump-ptx [executable-name]

cuda with mingw - updated

We have been developing our code in linux, but would like to compile a windows executable. The old non-gpu version compiles just fine with mingw in windows, so I was hoping I'd be able to do the same with the CUDA version.
The strategy is to compile kernel code with nvcc in visual studio, and the rest with gcc in mingw.
So far, we easily compiled the .cu file (with the kernel and kernel launches) in visual studio. However, we still can't compile the c code in mingw. The c code contains cuda api calls such as cudaMalloc and cuda types such as cudaEvent_t, so we must include cuda.h and cuda_runtime.h. However, gcc gives warnings and errors for these headers, for example:
../include/host_defines.h:57:0: warning: "__cdecl" redefined
and
../include/vector_functions.h:127:14: error: 'short_4' has no member named 'x'
Any ideas on how we can include these headers and compile the c portion of the code?
If you are really desperate there might be a way. The nvcc is really just a frontend for a bunch of compilers. It invokes g++ a lot to strip comments, separate device and host code, handle name mangling, link stuff back together, etc. (use --verbose) to get the details.
My idea is as follows: You should be able to compile the host code with mingw while compiling the device code to a fatbin on a linux machine (as I guess the device binary is host-machine independent). Afterwards link both parts of the code back together with mingw or use the driver API to load the fatbin dynamically. Disclaimer: Did not test!
As far as I know, it is impossible to use CUDA without MSVC. So, you need MSVC to make nvcc work, and you can compile CPU code with mingw and link everything together.
According to http://forums.nvidia.com/index.php?showtopic=30743
"There are no current plans to support mingw."
You might want to take a look at how the cycles renderer handles this, look at https://developer.blender.org/diffusion/B/browse/master/extern/cuew/ and
https://developer.blender.org/diffusion/B/browse/master/intern/cycles/device/device_cuda.cpp
I know it's not an automagic trick but it might help you get started.