NVCC compilation options for generating the best code (using JIT) - cuda

I am trying to understand nvcc compilation phases but I am a little bit confused. Because I don't know the exact hardware configuration of the machine that will run my software, I want to use JIT compilation feature in order to generate the best possible code for it. In the NVCC documentation I found this:
"For instance, the command below allows generation of exactly matching GPU binary code, when the application is launched on an sm_10, an sm_13, and even a later architecture:"
nvcc x.cu -arch=compute_10 -code=compute_10
So my understanding is that the above options will produce the best/fastest/optimum code for the current GPU. Is that correct? I also read that the default nvcc options are:
nvcc x.cu –arch=compute_10 -code=sm_10,compute_10
If the above is indeed correct, why I can't use any compute_20 features in my application?

When you specify a target architecture you are restricting yourself to the features available in that architecture. That's because the PTX code is a virtual assembly code, so you need to know the features available during PTX generation. The PTX will be JIT compiled to the GPU binary code (SASS) for whatever GPU you are running on, but it can't target newer architecture features.
I suggest that you pick a minimum architecture (for example, 1.3 if you want double precision or 2.0 if you want a Fermi-or-later feature) and then create PTX for that architecture AND newer base architectures. You can do this in one command (although it will take longer since it requires multiple passes through the code) and bundle everything into a single fat binary.
An example command line may be:
nvcc <general options> <filename.cu> \
-gencode arch=compute_13,code=compute_13 \
-gencode arch=compute_20,code=compute_20 \
-gencode arch=compute_30,code=compute_30 \
-gencode arch=compute_35,code=compute_35
That will create four PTX versions in the binary. You could also compile to selected GPUs at the same time which has the advantage of avoiding the JIT compile time for your users but also grows your binary size.
Check out the NVCC manual for more information on this.

Related

What if compute capabilities of cuda binary files does not match compute capability of current device? [duplicate]

I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches -code and -arch (or a combination of both using -gencode).
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
The documentation states that -arch specifies the virtual architectures for which the input files are compiled. However, this PTX code is not automatically compiled to machine code, but this is rather a "preprocessing step".
Now, -code is supposed to specify which architectures the PTX code is assembled and optimised for.
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created? And what will be embedded?
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
Some related questions/answers are here and here.
I am still not sure how to properly specify the architectures for code generation when building with nvcc.
A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:
-gencode arch=compute_XX,code=sm_XX
where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you'd like to include PTX in your executable, include an additional -gencode with the code option specifying the same PTX virtual architecture as the arch option).
Another fairly simple form, when targetting only a single GPU, is just to use:
-arch=sm_XX
with the same description for XX. This form will include both SASS and PTX for the specified architecture.
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
That is basically correct when arch and code are used as sub-switches within the -gencode switch, or if both are used together, standalone as you describe. But, for example, when -arch is used by itself (without -code), it represents another kind of "shorthand" notation, and in that case, you can pass a real architecture, for example -arch=sm_52
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created from? And what will be embedded?
The exact definition of what gets embedded varies depending on the form of the usage. But for this example:
-gencode arch=compute_30,code=sm_52
or for the equivalent case you identify:
-arch=compute_30 -code=sm_52
then yes, it means that:
A temporary PTX code will be generated from your source code, and it will use cc3.0 PTX.
From that PTX, the ptxas tool will generate cc5.2-compliant SASS code.
The SASS code will be embedded in your executable.
The PTX code will be discarded.
(I'm not sure why you would actually specify such a combo, but it is legal.)
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
-code=sm_52 will generate cc5.2 SASS code out of an intermediate PTX code. The SASS code will be embedded, the PTX will be discarded. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
-code=compute_52 will generate cc5.x PTX code (only) and embed that PTX in the executable/binary. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
The cuobjdump tool can be used to identify what components exactly are in a given binary.
(1) When no -gencode switch is used, and no -arch switch is used, nvcc assumes a default -arch=sm_20 is appended to your compile command (this is for CUDA 7.5, the default -arch setting may vary by CUDA version). sm_20 is a real architecture, and it is not legal to specify a real architecture on the -arch option when a -code option is also supplied.

CUDA: How to use -arch and -code and SM vs COMPUTE

I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches -code and -arch (or a combination of both using -gencode).
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
The documentation states that -arch specifies the virtual architectures for which the input files are compiled. However, this PTX code is not automatically compiled to machine code, but this is rather a "preprocessing step".
Now, -code is supposed to specify which architectures the PTX code is assembled and optimised for.
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created? And what will be embedded?
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
Some related questions/answers are here and here.
I am still not sure how to properly specify the architectures for code generation when building with nvcc.
A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:
-gencode arch=compute_XX,code=sm_XX
where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you'd like to include PTX in your executable, include an additional -gencode with the code option specifying the same PTX virtual architecture as the arch option).
Another fairly simple form, when targetting only a single GPU, is just to use:
-arch=sm_XX
with the same description for XX. This form will include both SASS and PTX for the specified architecture.
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
That is basically correct when arch and code are used as sub-switches within the -gencode switch, or if both are used together, standalone as you describe. But, for example, when -arch is used by itself (without -code), it represents another kind of "shorthand" notation, and in that case, you can pass a real architecture, for example -arch=sm_52
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created from? And what will be embedded?
The exact definition of what gets embedded varies depending on the form of the usage. But for this example:
-gencode arch=compute_30,code=sm_52
or for the equivalent case you identify:
-arch=compute_30 -code=sm_52
then yes, it means that:
A temporary PTX code will be generated from your source code, and it will use cc3.0 PTX.
From that PTX, the ptxas tool will generate cc5.2-compliant SASS code.
The SASS code will be embedded in your executable.
The PTX code will be discarded.
(I'm not sure why you would actually specify such a combo, but it is legal.)
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
-code=sm_52 will generate cc5.2 SASS code out of an intermediate PTX code. The SASS code will be embedded, the PTX will be discarded. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
-code=compute_52 will generate cc5.x PTX code (only) and embed that PTX in the executable/binary. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
The cuobjdump tool can be used to identify what components exactly are in a given binary.
(1) When no -gencode switch is used, and no -arch switch is used, nvcc assumes a default -arch=sm_20 is appended to your compile command (this is for CUDA 7.5, the default -arch setting may vary by CUDA version). sm_20 is a real architecture, and it is not legal to specify a real architecture on the -arch option when a -code option is also supplied.

CUDA nvcc - build with local card max compute capablity

I can specify to the cuda nvcc compiler the compute capability, and the default is 2.0: -gencode=arch=compute_20,code=\"sm_20,compute_20\".
I have two computers. One can do compute_20, the other can do compute_30. I am using visual studio. Is there away to specify to nvcc to use the maximum local card capability? Otherwise, I would need to have a separate project (.vcxproj) on each computer (specifying the max compute capability manually), which isn't ideal.
Yes, you can specify multiple targets. The CUDA sample codes give examples of how to do this in a Visual Studio project. The basic idea would be to specify multiple -gencode switches (on the nvcc compile command line) via VS project settings under project...CUDA...device (this can also be specified on a source file-by-file basis). In Visual Studio, you just specify switch parameters, like:
compute_20,sm_20;compute_30,sm_30;compute_35,sm_35;
and the visual studio cuda-enabled build system will convert that to a sequence of gencode switches like:
-gencode arch=compute20,code=sm_20 -gencode arch=compute_30,code=sm_30 ...
which the nvcc compiler will recognize and generate separate device code for the various targets specified. This is a fairly complicated subject, so you may want to read about the fatbinary system and nvcc compilation flow in the nvcc manual, or study other questions about it on the cuda tag here on SO like this one.
Anticipating some of your other questions, that are also covered in the nvcc manual:
The CUDA runtime will select the best fit for the actual device, based on the available targets in your fatbinary. If an exact SASS compiled binary exists, it will use that, otherwise it will take the closest PTX object and JIT-compile for the intended device.
The __CUDA_ARCH__ macro exists and is defined in device code. You could use it to specialize device code for various targets, which would give you a tedious mechanism to verify that the CUDA runtime did the expected thing in selection of objects for use.

How to detect NVIDIA CUDA Architecture [duplicate]

I've recently gotten my head around how NVCC compiles CUDA device code for different compute architectures.
From my understanding, when using NVCC's -gencode option, "arch" is the minimum compute architecture required by the programmer's application, and also the minimum device compute architecture that NVCC's JIT compiler will compile PTX code for.
I also understand that the "code" parameter of -gencode is the compute architecture which NVCC completely compiles the application for, such that no JIT compilation is necessary.
After inspection of various CUDA project Makefiles, I've noticed the following occur regularly:
-gencode arch=compute_20,code=sm_20
-gencode arch=compute_20,code=sm_21
-gencode arch=compute_21,code=sm_21
and after some reading, I found that multiple device architectures could be compiled for in a single binary file - in this case sm_20, sm_21.
My questions are why are so many arch / code pairs necessary? Are all values of "arch" used in the above?
what is the difference between that and say:
-arch compute_20
-code sm_20
-code sm_21
Is the earliest virtual architecture in the "arch" fields selected automatically, or is there some other obscure behaviour?
Is there any other compilation and runtime behaviour I should be aware of?
I've read the manual, http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation and I'm still not clear regarding what happens at compilation or runtime.
Roughly speaking, the code compilation flow goes like this:
CUDA C/C++ device code source --> PTX --> SASS
The virtual architecture (e.g. compute_20, whatever is specified by -arch compute...) determines what type of PTX code will be generated. The additional switches (e.g. -code sm_21) determine what type of SASS code will be generated. SASS is actually executable object code for a GPU (machine language). An executable can contain multiple versions of SASS and/or PTX, and there is a runtime loader mechanism that will pick appropriate versions based on the GPU actually being used.
As you point out, one of the handy features of GPU operation is JIT-compile. JIT-compile will be done by the GPU driver (does not require the CUDA toolkit to be installed) anytime a suitable PTX code is available but a suitable SASS code is not. The definition of a "suitable PTX" code is one which is numerically equal to or lower than the GPU architecture being targeted for running the code. To pick an example, specifying arch=compute_30,code=compute_30 would tell nvcc to embed cc3.0 PTX code in the executable. This PTX code could be used to generate SASS code for any future architecture that the GPU driver supports. Currently this would include architectures like Pascal, Volta, Turing, etc. assuming the GPU driver supports those architectures.
One advantage of including multiple virtual architectures (i.e. multiple versions of PTX), then, is that you have executable compatibility with a wider variety of target GPU devices (although some devices may trigger a JIT-compile to create the necessary SASS).
One advantage of including multiple "real GPU targets" (i.e. multiple SASS versions) is that you can avoid the JIT-compile step, when one of those target devices is present.
If you specify a bad set of options, it's possible to create an executable that won't run (correctly) on a particular GPU.
One possible disadvantage of specifying a lot of these options is code size bloat. Another possible disadvantage is compile time, which will generally be longer as you specify more options.
It's also possible to create excutables that contain no PTX, which may be of interest to those trying to obscure their IP.
Creating PTX suitable for JIT should be done by specifying a virtual architecture for the code switch.
The purpose of multiple -arch flags is to use the __CUDA_ARCH__ macro for conditional compilation (ie, using #ifdef) of differently-optimized code paths.
See here: http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#virtual-architecture-identification-macro

Does 'code=sm_X' embed only binary (cubin) code, or also PTX code, or both?

I am little bit confused about the 'code=sm_X' option within the '-gencode' statement.
An example: What does the NVCC compiler option
-gencode arch=compute_13,code=sm_13
embed in the library ?
Only the machine code (cubin code) for GPUs with CC 1.3, or also the PTX code for GPUs with CC 1.3 ?
In the 'Maxwell compatibility guide', it is stated "Only the back-end target versions(s) specified by the 'code=' clause will be retained in the resulting binary".
From that, I would infer that the given compiler option only embeds machine code for GPUs with CC 1.3 and no PTX code. This would mean that it would not be possible to run this library e.g. on aa Maxwell generation card, as there is no PTX code embeded within the library from which the machine code could be 'just-in-time' (JIT) compiled.
On the other side, on the GTC 2013 presentation 'Introduction to the CUDA Toolkit as an Application Build Tool' by NVIDIA it is stated that the '-gencode arch=compute_13,code=sm_13' is enough for all GPUs with CC >= 1.3, and that with this compiler option for GPUs with CC > 1.3 the machine code is JIT-ed from the PTX code. So, the information given in the Maxwell compatibility guide and this GTC presentation is conflicting in my opinion.
nvcc has many formats by which the code generation options can be specified. A read of section 6 of the nvcc manual may be instructive.
when using this format:
nvcc -gencode arch=compute_13,code=sm_13 ...
only the SASS code for a sm_13 (cc 1.3) device will be retained. There will be no PTX retained in the executable object, and so the code can only run on a device capable of running cc1.3 SASS.
Using the above command format, in order to embed a PTX version of the source code into the executable object, it's necessary to use a virtual architecture specification for the option provided to code=.... Since this particular format (using -gencode) does not allow specification of multiple targets in a single switch, we must pass the -gencode switch multiple times to nvcc, one for each target we desire to be embedded in the executable object.
So extending the above example, we could use the following:
nvcc -gencode arch=compute_13,code=sm_13 -gencode arch=compute_13,code=compute_13 ...
This would embed both cc1.3 SASS (by the first gencode switch) and cc1.3 PTX (by the second gencode switch) in the executable. Devices capable of running cc1.3 SASS code directly will use that. Other devices (of compute capability greater than cc 1.3) will do a JIT-compile step by the driver, to convert the cc1.3 PTX code to a SASS code with an architecture suitable for the device in question.
I agree that the GTC 2013 presentation (e.g. slide 37) seems to suggest that
nvcc -gencode arch=compute_13,code=sm_13 ...
is sufficient for all devices of compute capability 1.3 or higher. It is not, and this is easy to demonstrate. If you compile a code using the above format, and attempt to run it on a cc 2.0 device, it will fail with an "invalid device function" error associated with any kernel or kernels you have in your code.
Again, nvcc has a variety of command formats and "shortcuts" for specifying code generation. Some relatively simple ones, such as:
nvcc -arch=sm_13 ...
will embed both a PTX and SASS version of the code in the executable object, resulting in the kind of forward-compatibility suggested.