nvcc -Xptxas –v compiler flag has no effect - cuda

I have a CUDA project. It consists of several .cpp files that contain my application logic and one .cu file that contains multiple kernels plus a __host__ function that invokes them.
Now I would like to determine the number of registers used by my kernel(s). My normal compiler call looks like this:
nvcc -arch compute_20 -link src/kernel.cu obj/..obj obj/..obj .. -o bin/..exe -l glew32 ...
Adding the "-Xptxas –v" compiler flag to this call unfortunately has no effect. The compiler still produces the same textual output as before. The compiled .exe also works the same way as before with one exception: My framerate jumps to 1800fps, up from 80fps.

I had the same problem, here is my solution:
Compile *cu files into device only *ptx file, this will discard host code
nvcc -ptx *.cu
Compile *ptx file:
ptxas -v *.ptx
The second step will show you number of used registers by kernel and amount of used shared memory.

Convert the compute_20 to sm_20 in your compiler call. That should fix it.

When using "-Xptxas -v", "-arch" together, we can not get verbose information(register num, etc.). If we want to see the verbose without losing the chance of assigning GPU architecture(-arch, -code) ahead, we can do the following steps: nvcc -arch compute_XX *.cu -keep then ptxas -v *.ptx. But we will obtain many processing files. Certainly, kogut's answer is to the point.

when you compile
nvcc --ptxas-options=-v

You may want to ctrl your compiler verbose option defaults.
For example is VStudio goto :
Tools->Options->ProjectsAndSolutions->BuildAndRun
then set the verbosity output to Normal.

Not exactly what you were looking for, but you can use the CUDA visual profiler shipped with the nvidia gpu computing sdk. Besides many other useful informations, it shows the number of registers used by each kernel in you application.

Related

What if compute capabilities of cuda binary files does not match compute capability of current device? [duplicate]

I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches -code and -arch (or a combination of both using -gencode).
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
The documentation states that -arch specifies the virtual architectures for which the input files are compiled. However, this PTX code is not automatically compiled to machine code, but this is rather a "preprocessing step".
Now, -code is supposed to specify which architectures the PTX code is assembled and optimised for.
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created? And what will be embedded?
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
Some related questions/answers are here and here.
I am still not sure how to properly specify the architectures for code generation when building with nvcc.
A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:
-gencode arch=compute_XX,code=sm_XX
where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you'd like to include PTX in your executable, include an additional -gencode with the code option specifying the same PTX virtual architecture as the arch option).
Another fairly simple form, when targetting only a single GPU, is just to use:
-arch=sm_XX
with the same description for XX. This form will include both SASS and PTX for the specified architecture.
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
That is basically correct when arch and code are used as sub-switches within the -gencode switch, or if both are used together, standalone as you describe. But, for example, when -arch is used by itself (without -code), it represents another kind of "shorthand" notation, and in that case, you can pass a real architecture, for example -arch=sm_52
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created from? And what will be embedded?
The exact definition of what gets embedded varies depending on the form of the usage. But for this example:
-gencode arch=compute_30,code=sm_52
or for the equivalent case you identify:
-arch=compute_30 -code=sm_52
then yes, it means that:
A temporary PTX code will be generated from your source code, and it will use cc3.0 PTX.
From that PTX, the ptxas tool will generate cc5.2-compliant SASS code.
The SASS code will be embedded in your executable.
The PTX code will be discarded.
(I'm not sure why you would actually specify such a combo, but it is legal.)
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
-code=sm_52 will generate cc5.2 SASS code out of an intermediate PTX code. The SASS code will be embedded, the PTX will be discarded. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
-code=compute_52 will generate cc5.x PTX code (only) and embed that PTX in the executable/binary. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
The cuobjdump tool can be used to identify what components exactly are in a given binary.
(1) When no -gencode switch is used, and no -arch switch is used, nvcc assumes a default -arch=sm_20 is appended to your compile command (this is for CUDA 7.5, the default -arch setting may vary by CUDA version). sm_20 is a real architecture, and it is not legal to specify a real architecture on the -arch option when a -code option is also supplied.

How to remove all PTX from compiled CUDA to prevent Intellectual Property leaks

CUDA PTX is analogous to assembly, and as such reveals the source code. I have read Section 3.1 of the CUDA Programming Guide and Section 3.2.7 from the online CUDA compiler documentation. I have a basic understanding of the -arch versus -code compiler options.
If I understand correctly, specifying -arch compute_XX makes PTX. Whereas -code sm_XX makes both PTX and cubin.
I desire only cubin, such that no PTX is in the resulting image. How can I achieve this?
Preferably via Visual Studio settings, although I only find the -gencode option within Visual Studio Project Settings.
PTX is not quite analogous to assembly. PTX is an intermediate representation of the program that can be compiled to the different, incompatible instruction set architectures (ISA) that Nvidia GPUs have been using over time. Usually, a new ISA for Nvidia GPUs comes with an updated version of PTX that can represent the new features of the ISA.
The -arch and -code options to nvcc work slightly differently to what you describe. They are not (mutual exclusive) alternatives, rather they determine different aspects.
-arch controls which PTX version is used as the intermediate representation. As such it is combined with a compute_XX PTX version.
-code controls what code is embedded into the resulting binary - either machine code for the specified ISA if used in the -code sm_XX form, or PTX to be just-in-time compiled by the GPU driver if -code compute_XX is specified.
As a special shortcut, specifying only -arch sm_XX will embed both the compiled code for the specified ISA and PTX code into the binary - this probably is the situation that you are referring to that you want to avoid.
Finally the -gencode option allows you to specify multiple -arch/-code pairs, with the resulting binary containing separate code for each of the pairs.
You can use nvprune to remove all but the desired ISA code from a binary.
If unsure, you can always use cuobjdump to check what is in a specific binary.
So the way to prevent any PTX code from being present in your resulting binary is to call nvcc as nvcc -arch compute_XX -code sm_XX (or use multiple such pairs together with -gencode).

CUDA: How to use -arch and -code and SM vs COMPUTE

I am still not sure how to properly specify the architectures for code generation when building with nvcc. I am aware that there is machine code as well as PTX code embedded in my binary and that this can be controlled via the controller switches -code and -arch (or a combination of both using -gencode).
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
The documentation states that -arch specifies the virtual architectures for which the input files are compiled. However, this PTX code is not automatically compiled to machine code, but this is rather a "preprocessing step".
Now, -code is supposed to specify which architectures the PTX code is assembled and optimised for.
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created? And what will be embedded?
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
Some related questions/answers are here and here.
I am still not sure how to properly specify the architectures for code generation when building with nvcc.
A complete description is somewhat complicated, but there are intended to be relatively simple, easy-to-remember canonical usages. Compile for the architecture (both virtual and real), that represents the GPUs you wish to target. A fairly simple form is:
-gencode arch=compute_XX,code=sm_XX
where XX is the two digit compute capability for the GPU you wish to target. If you wish to target multiple GPUs, simply repeat the entire sequence for each XX target. This is approximately the approach taken with the CUDA sample code projects. (If you'd like to include PTX in your executable, include an additional -gencode with the code option specifying the same PTX virtual architecture as the arch option).
Another fairly simple form, when targetting only a single GPU, is just to use:
-arch=sm_XX
with the same description for XX. This form will include both SASS and PTX for the specified architecture.
Now, according to this apart from the two compiler flags there are also two ways of specifying architectures: sm_XX and compute_XX, where compute_XX refers to a virtual and sm_XX to a real architecture. The flag -arch only takes identifiers for virtual architectures (such as compute_XX) whereas the -code flag takes both, identifiers for real and for virtual architectures.
That is basically correct when arch and code are used as sub-switches within the -gencode switch, or if both are used together, standalone as you describe. But, for example, when -arch is used by itself (without -code), it represents another kind of "shorthand" notation, and in that case, you can pass a real architecture, for example -arch=sm_52
However, it is not clear which PTX or binary code will be embedded in the binary. If I specify for example -arch=compute_30 -code=sm_52, does that mean my code will first be compiled to feature level 3.0 PTX from which afterwards machine code for feature level 5.2 will be created from? And what will be embedded?
The exact definition of what gets embedded varies depending on the form of the usage. But for this example:
-gencode arch=compute_30,code=sm_52
or for the equivalent case you identify:
-arch=compute_30 -code=sm_52
then yes, it means that:
A temporary PTX code will be generated from your source code, and it will use cc3.0 PTX.
From that PTX, the ptxas tool will generate cc5.2-compliant SASS code.
The SASS code will be embedded in your executable.
The PTX code will be discarded.
(I'm not sure why you would actually specify such a combo, but it is legal.)
If I just specify -code=sm_52 what will happen then? Only machine code for V5.2 will be embedded that has been created out of V5.2 PTX code? And what would be the difference to -code=compute_52?
-code=sm_52 will generate cc5.2 SASS code out of an intermediate PTX code. The SASS code will be embedded, the PTX will be discarded. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
-code=compute_52 will generate cc5.x PTX code (only) and embed that PTX in the executable/binary. Note that specifying this option by itself in this form, with no -arch option, would be illegal. (1)
The cuobjdump tool can be used to identify what components exactly are in a given binary.
(1) When no -gencode switch is used, and no -arch switch is used, nvcc assumes a default -arch=sm_20 is appended to your compile command (this is for CUDA 7.5, the default -arch setting may vary by CUDA version). sm_20 is a real architecture, and it is not legal to specify a real architecture on the -arch option when a -code option is also supplied.

NVCC compilation options for generating the best code (using JIT)

I am trying to understand nvcc compilation phases but I am a little bit confused. Because I don't know the exact hardware configuration of the machine that will run my software, I want to use JIT compilation feature in order to generate the best possible code for it. In the NVCC documentation I found this:
"For instance, the command below allows generation of exactly matching GPU binary code, when the application is launched on an sm_10, an sm_13, and even a later architecture:"
nvcc x.cu -arch=compute_10 -code=compute_10
So my understanding is that the above options will produce the best/fastest/optimum code for the current GPU. Is that correct? I also read that the default nvcc options are:
nvcc x.cu –arch=compute_10 -code=sm_10,compute_10
If the above is indeed correct, why I can't use any compute_20 features in my application?
When you specify a target architecture you are restricting yourself to the features available in that architecture. That's because the PTX code is a virtual assembly code, so you need to know the features available during PTX generation. The PTX will be JIT compiled to the GPU binary code (SASS) for whatever GPU you are running on, but it can't target newer architecture features.
I suggest that you pick a minimum architecture (for example, 1.3 if you want double precision or 2.0 if you want a Fermi-or-later feature) and then create PTX for that architecture AND newer base architectures. You can do this in one command (although it will take longer since it requires multiple passes through the code) and bundle everything into a single fat binary.
An example command line may be:
nvcc <general options> <filename.cu> \
-gencode arch=compute_13,code=compute_13 \
-gencode arch=compute_20,code=compute_20 \
-gencode arch=compute_30,code=compute_30 \
-gencode arch=compute_35,code=compute_35
That will create four PTX versions in the binary. You could also compile to selected GPUs at the same time which has the advantage of avoiding the JIT compile time for your users but also grows your binary size.
Check out the NVCC manual for more information on this.

CUDA code too large to be expanded [duplicate]

I have a CUDA class, let's call it A, defined in a header file. I have written a test kernel which creates an instance of class A, which compiles fine and produces the expected result.
In addition, I have my main CUDA kernel, which also compiles fine and produces the expected result. However, when I add code to my main kernel to instantiate an instance of class A, the nvcc compiler fails with a segmentation fault.
Update:
To clarify, the segmentation fault happens during compilation, not when running the kernel. The line I am using to compile is:
`nvcc --cubin -arch compute_20 -code sm_20 -I<My include dir> --keep kernel.cu`
where <My include dir> is the path to my local path containing some utility header files.
My question is, before spending a lot of time isolating a minimal example exhibiting the behaviour (not trivial, due to relatively large code base), has anyone encountered a similar issue? Would it be possible for the nvcc compiler to fail and die if the kernel is either too long or uses too many registers?
If an issue such as register count can affect the compiler this way, then I will need to rethink how to implement my kernel to use fewer resources. This would also mean that trimming things down to a minimal example will likely make the problem disappear. However, if this is not even a possibility, I don't want to waste time on a dead-end, but will rather try to cut things down to a minimal example and will file a bug report to NVIDIA.
Update:
As per the suggestion of #njuffa, I reran the compilation with the -v flag enabled. The output ends with the following:
#$ ptxas -arch=sm_20 -m64 -v "/path/to/kernel_ptx/kernel.ptx" -o "kernel.cubin"
Segmentation fault
# --error 0x8b --
This suggests the problem is due to the ptxas program, which is failing to generate a CUDA binary from the ptx file.
This would appear to have been a genuine bug of some sort in the CUDA 5.0 ptxas assembler. It was reported to NVIDIA and we can assume that it was fixed sometime during the more than three years since the question was asked and this answer added.
[This answer has been assembled from comments and added as a community wiki entry to get this question off the unanswered question list ]