Simple way to merge multiple source files into one fatbinary

Simple way to merge multiple source files into one fatbinary - cuda

To simplify the build process in a project, I'd like to compile multiple source files into device PTX code, and have all those modules in a single .fatbin file to be linked later.
I can achieve this currently through either compiling each file individually to .ptx, or compiling all simultaneously while using --keep to keep intermediate files, then adding each to a fatbinary explicitly:
nvcc -c --keep mysource1.cu mysource2.cu ...
fatbinary --create="mysources.fatbin" --image3=kind=ptx,file=mysource1.ptx --image3=kind=ptx,file=mysource2.ptx ...
This is quite cumbersome though, so I was wondering if there is a simpler/more terse way of doing so, perhaps in a single nvcc invocation. I've tried calling nvcc --fatbin --device-link on multiple source files, but that does not seem to keep the ptx code in the output fatbinary (at least not when inspecting with cuobjdump).

One possible approach here would be to use a library. The command could look something like this:
nvcc -gencode arch=compute_XX,code=sm_XX -gencode ... --lib -rdc=true -o libmy.a mysource1.cu ...
The above command could be used in the case where you know device linking will eventually be necessary. In that case, you would specify the device-link step later, when you link objects or your final executable against the static library.
For the case where you know that device linking will not be necessary, just omit the -rdc=true switch.

Related

CUDA: How to use -arch and -code and SM vs COMPUTE

I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches -code and -arch (or a combination of both using -gencode).
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
The documentation states that -arch specifies the virtual architectures for which the input files are compiled. However, this PTX code is not automatically compiled to machine code, but this is rather a "preprocessing step".
Now, -code is supposed to specify which architectures the PTX code is assembled and optimised for.
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created? And what will be embedded?
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?

Some related questions/answers are here and here.
I am still not sure how to properly specify the architectures for code generation when building with nvcc.
A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:
-gencode arch=compute_XX,code=sm_XX
where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you'd like to include PTX in your executable, include an additional -gencode with the code option specifying the same PTX virtual architecture as the arch option).
Another fairly simple form, when targetting only a single GPU, is just to use:
-arch=sm_XX
with the same description for XX. This form will include both SASS and PTX for the specified architecture.
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
That is basically correct when arch and code are used as sub-switches within the -gencode switch, or if both are used together, standalone as you describe. But, for example, when -arch is used by itself (without -code), it represents another kind of "shorthand" notation, and in that case, you can pass a real architecture, for example -arch=sm_52
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created from? And what will be embedded?
The exact definition of what gets embedded varies depending on the form of the usage. But for this example:
-gencode arch=compute_30,code=sm_52
or for the equivalent case you identify:
-arch=compute_30 -code=sm_52
then yes, it means that:
A temporary PTX code will be generated from your source code, and it will use cc3.0 PTX.
From that PTX, the ptxas tool will generate cc5.2-compliant SASS code.
The SASS code will be embedded in your executable.
The PTX code will be discarded.
(I'm not sure why you would actually specify such a combo, but it is legal.)
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
-code=sm_52 will generate cc5.2 SASS code out of an intermediate PTX code. The SASS code will be embedded, the PTX will be discarded. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
-code=compute_52 will generate cc5.x PTX code (only) and embed that PTX in the executable/binary. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
The cuobjdump tool can be used to identify what components exactly are in a given binary.
(1) When no -gencode switch is used, and no -arch switch is used, nvcc assumes a default -arch=sm_20 is appended to your compile command (this is for CUDA 7.5, the default -arch setting may vary by CUDA version). sm_20 is a real architecture, and it is not legal to specify a real architecture on the -arch option when a -code option is also supplied.

NVCC compilation options for generating the best code (using JIT)

I am trying to understand nvcc compilation phases but I am a little bit confused. Because I don't know the exact hardware configuration of the machine that will run my software, I want to use JIT compilation feature in order to generate the best possible code for it. In the NVCC documentation I found this:
"For instance, the command below allows generation of exactly matching GPU binary code, when the application is launched on an sm_10, an sm_13, and even a later architecture:"
nvcc x.cu -arch=compute_10 -code=compute_10
So my understanding is that the above options will produce the best/fastest/optimum code for the current GPU. Is that correct? I also read that the default nvcc options are:
nvcc x.cu –arch=compute_10 -code=sm_10,compute_10
If the above is indeed correct, why I can't use any compute_20 features in my application?

When you specify a target architecture you are restricting yourself to the features available in that architecture. That's because the PTX code is a virtual assembly code, so you need to know the features available during PTX generation. The PTX will be JIT compiled to the GPU binary code (SASS) for whatever GPU you are running on, but it can't target newer architecture features.
I suggest that you pick a minimum architecture (for example, 1.3 if you want double precision or 2.0 if you want a Fermi-or-later feature) and then create PTX for that architecture AND newer base architectures. You can do this in one command (although it will take longer since it requires multiple passes through the code) and bundle everything into a single fat binary.
An example command line may be:
nvcc <general options> <filename.cu> \
-gencode arch=compute_13,code=compute_13 \
-gencode arch=compute_20,code=compute_20 \
-gencode arch=compute_30,code=compute_30 \
-gencode arch=compute_35,code=compute_35
That will create four PTX versions in the binary. You could also compile to selected GPUs at the same time which has the advantage of avoiding the JIT compile time for your users but also grows your binary size.
Check out the NVCC manual for more information on this.

How can I read the PTX?

I am working with Capabilities 3.5, CUDA 5 and VS 2010 (and obviously Windows).
I am interested in reading the compiled code to understand better the implication of my C code changes.
What configuration do I need in VS to compile the code for readability (is setting the compilation to PTX enough?)?
What tool do I need to reverse engineer the generated PTX to be able to read it?

In general, to create a ptx version of a particular .cu file, the command is:
nvcc -ptx mycode.cu
which will generate a mycode.ptx file containing the ptx code corresponding to the file you used. It's probably instructive to use the -src-in-ptx option as well:
nvcc -ptx -src-in-ptx mycode.cu
Which will intersperse the lines of source code with the lines of ptx they correspond to.
To comprehend ptx, start with the documentation
Note that the compiler may generate ptx code that doesn't correspond to the source code very well, or is otherwise confusing, due to optimizations. You may wish (perhaps to gain insight) to compile some test cases using the -G switch as well, to see how the non-optimized version compares.
Since the windows environment may vary from machine to machine, I think it's easier if you just look at the path your particular version of msvc++ is using to invoke nvcc (look at the console output from one of your projects when you compile it) and prepend the commands I give above with that path. I'm not sure there's much utility in trying to build this directly into Visual Studio, unless you have a specific need to compile from ptx to an executable. There are also a few sample codes that have to do with ptx in some fashion.
Also note for completeness that ptx is not actually what's executed by the device (but generally pretty close). It is an intermediate code that can be re-targetted to devices within a family by nvcc or a portion of the compiler that also lives in the GPU driver. To see the actual code executed by the device, we use the executable instead of the source code, and the tool to extract the machine assembly code is:
cuobjdump -sass mycode.exe
Similar caveats about prepending an appropriate path, if needed. I would start with the ptx. I think for what you want to do, it's enough.

How to link host code with a static CUDA library after separable compilation?

Alright, I have a really troubling CUDA 5.0 question about how to link things properly. I'd be really grateful for any assistance!
Using the separable compilation features of CUDA 5.0, I generated a static library (*.a). This nicely links with other *.cu files when run through nvcc, I have done this many times.
I'd now like to take a *.cpp file and link it against the host code in this static library using g++ or whatever, but not nvcc. If I attempt this, I get compiler errors like
undefined reference to __cudaRegisterLinkedBinary
I'm using both -lcuda and -lcudart and, to my knowledge, have the libraries in the correct order (meaning -lmylib -lcuda -lcudart). I don't think it is an issue with that. Maybe I'm wrong, but I feel I'm missing a step and that I need to do something else to my static library (device linking?) before I can use it with g++.
Have I missed something crucial? Is this even possible?
Bonus question: I want the end result to be a dynamic library. How can I achieve this?

When you link with nvcc, it does an implicit device link along with the host link. If you use the host compiler to link (like with g++), then you need to add an explicit step to do a device link with the –dlink option, e.g.
nvcc –arch=sm_35 –dc a.cu b.cu
nvcc –arch=sm_35 –dlink a.o b.o –o dlink.o
g++ a.o b.o dlink.o x.cpp –lcudart
There is an example of exactly this in the Using Separate Compilation chapter of the nvcc doc.
Currently we only support static libraries for relocatable device code. We’d be interested in learning how you would want to use such code in a dynamic library. Please feel free to answer in the comments.
Edit:
To answer the question in the comment below " Is there any way to use nvcc to turn mylib.a into something that can be put into g++?"
Just use the library like an object, like this:
nvcc –arch=sm_35 –dlink mylib.a –o dlink.o
g++ mylib.a dlink.o x.cpp –lcudart

You can use libraries anywhere you use objects. So just do
nvcc –arch=sm_35 –dlink mylib.a –o dlink.o
g++ mylib.a dlink.o x.cpp –lcudart

nvcc -Xptxas –v compiler flag has no effect

I have a CUDA project. It consists of several .cpp files that contain my application logic and one .cu file that contains multiple kernels plus a __host__ function that invokes them.
Now I would like to determine the number of registers used by my kernel(s). My normal compiler call looks like this:
nvcc -arch compute_20 -link src/kernel.cu obj/..obj obj/..obj .. -o bin/..exe -l glew32 ...
Adding the "-Xptxas –v" compiler flag to this call unfortunately has no effect. The compiler still produces the same textual output as before. The compiled .exe also works the same way as before with one exception: My framerate jumps to 1800fps, up from 80fps.

I had the same problem, here is my solution:
Compile *cu files into device only *ptx file, this will discard host code
nvcc -ptx *.cu
Compile *ptx file:
ptxas -v *.ptx
The second step will show you number of used registers by kernel and amount of used shared memory.

Convert the compute_20 to sm_20 in your compiler call. That should fix it.

When using "-Xptxas -v", "-arch" together, we can not get verbose information(register num, etc.). If we want to see the verbose without losing the chance of assigning GPU architecture(-arch, -code) ahead, we can do the following steps: nvcc -arch compute_XX *.cu -keep then ptxas -v *.ptx. But we will obtain many processing files. Certainly, kogut's answer is to the point.

when you compile
nvcc --ptxas-options=-v

You may want to ctrl your compiler verbose option defaults.
For example is VStudio goto :
Tools->Options->ProjectsAndSolutions->BuildAndRun
then set the verbosity output to Normal.

Not exactly what you were looking for, but you can use the CUDA visual profiler shipped with the nvidia gpu computing sdk. Besides many other useful informations, it shows the number of registers used by each kernel in you application.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008