Setting 32 bit address size in inline PTX - cuda

I'm in the processing of converting PTX written as a separate file to inline PTX. In the separate PTX file, I was defining the ISA and target as follows:
.version 1.2
.target sm_13
In the PTX file generated by the compiler, after having inlined the PTX, the compiler has specified ISA and target as follows:
.version 3.0
.target sm_20
.address_size 64
The .address_size 64 is problematic for me because it means that I would have to update the pointer arithmetic that I do in the inline PTX from 32 bit to 64 bit.
Given that 32 bits can address 4GB, more memory than my card has, is it possible to make the compiler specify a 32 bit address size, so that I don't have to update the pointer arithmetic?
Are 32 bit addresses supported on sm_20, given the new unified addressing system?

The 64 bit version of the NVCC compiler produces 64 bit PTX by default. If you try passing -m32 to nvcc as a command line option, it will generate 32 bit pointers. The option is covered in the NVCC documentation:
http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#options-for-guiding-compiler-driver

Related

How to remove all PTX from compiled CUDA to prevent Intellectual Property leaks

CUDA PTX is analogous to assembly, and as such reveals the source code. I have read Section 3.1 of the CUDA Programming Guide and Section 3.2.7 from the online CUDA compiler documentation. I have a basic understanding of the -arch versus -code compiler options.
If I understand correctly, specifying -arch compute_XX makes PTX. Whereas -code sm_XX makes both PTX and cubin.
I desire only cubin, such that no PTX is in the resulting image. How can I achieve this?
Preferably via Visual Studio settings, although I only find the -gencode option within Visual Studio Project Settings.
PTX is not quite analogous to assembly. PTX is an intermediate representation of the program that can be compiled to the different, incompatible instruction set architectures (ISA) that Nvidia GPUs have been using over time. Usually, a new ISA for Nvidia GPUs comes with an updated version of PTX that can represent the new features of the ISA.
The -arch and -code options to nvcc work slightly differently to what you describe. They are not (mutual exclusive) alternatives, rather they determine different aspects.
-arch controls which PTX version is used as the intermediate representation. As such it is combined with a compute_XX PTX version.
-code controls what code is embedded into the resulting binary - either machine code for the specified ISA if used in the -code sm_XX form, or PTX to be just-in-time compiled by the GPU driver if -code compute_XX is specified.
As a special shortcut, specifying only -arch sm_XX will embed both the compiled code for the specified ISA and PTX code into the binary - this probably is the situation that you are referring to that you want to avoid.
Finally the -gencode option allows you to specify multiple -arch/-code pairs, with the resulting binary containing separate code for each of the pairs.
You can use nvprune to remove all but the desired ISA code from a binary.
If unsure, you can always use cuobjdump to check what is in a specific binary.
So the way to prevent any PTX code from being present in your resulting binary is to call nvcc as nvcc -arch compute_XX -code sm_XX (or use multiple such pairs together with -gencode).

Determining shared memory usage in CUDA Fortran

I've been writing some basic CUDA Fortran code. I would like to be able to determine the amount of shared memory my program uses per thread block (for occupancy calculation). I have been compiling with -Mcuda=ptxinfo in the hope of finding this information. The compilation output ends with
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1128 bytes spill stores, 604 bytes spill loads
ptxas info : Used 63 registers, 96 bytes smem, 320 bytes cmem[0]
which is the only place in the output that smem is mentioned. There is one array in the global subroutine main_kernel with the shared attribute. If I remove the shared attribute then I get
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1124 bytes spill stores, 532 bytes spill loads
ptxas info : Used 63 registers, 320 bytes cmem[0]
The smem has disappeared. It seems that only shared memory in main_kernel is being counted: device subroutines in my code use variables with the shared attribute but these don't appear to be mentioned in the output e.g the device subroutine evalfuncs includes shared variable declarations but the relevant output is
ptxas info : Function properties for device_procedures_evalfuncs_
504 bytes stack frame, 1140 bytes spill stores, 508 bytes spill loads
Do all variables with the shared attribute need to be declared in a global subroutine?
Do all variables with the shared attribute need to be declared in a global subroutine?
No.
You haven't shown an example code, your compile command, nor have you identified the version of the PGI compiler tools you are using. However, the most likely explanation I can think of for what you are seeing is that as of PGI 14.x, the default CUDA compile option is to generate relocatable device code. This is documented in section 2.2.3 of the current PGI release notes:
2.2.3. Relocatable Device Code
An rdc option is available for the –ta=tesla and –Mcuda flags that specifies to generate
relocatable device code. Starting in PGI 14.1 on Linux and in PGI 14.2 on Windows, the default
code generation and linking mode for Tesla-target OpenACC and CUDA Fortran is rdc,
relocatable device code.
You can disable the default and enable the old behavior and non-relocatable code by specifying
any of the following: –ta=tesla:nordc, –Mcuda=nordc, or by specifying any 1.x compute
capability or any Radeon target.
So a specific option to (disable)enable this is:
–Mcuda=(no)rdc
(note that -Mcuda=rdc is the default, if you don't specify this option)
CUDA Fortran separates Fortran host code from device code. For the device code, the CUDA Fortran compiler does a CUDA Fortran->CUDA C conversion, and passes the auto-generated CUDA C code to the CUDA C compiler. Therefore, the behavior and expectations of switches like rdc and ptxinfo are derived from the behavior of the underlying equivalent CUDA compiler options (-rdc=true and -Xptxas -v, respectively).
When CUDA device code is compiled without the rdc option, the compiler will normally try to inline device (sub)routines that are called from a kernel, into the main kernel code. Therefore, when the compiler is generating the ptxinfo, it can determine all resource requirements (e.g. shared memory, registers, etc.) when it is compiling (ptx assembly) the kernel code.
When the rdc option is specified, however, the compiler may (depending on some other switches and function attributes) leave the device subroutines as separately callable routines with their own entry point (i.e. not inlined). In that scenario, when the device compiler is compiling the kernel code, the call to the device subroutine just looks like a call instruction, and the compiler (at that point) has no visibility into the resource usage requirements of the device subroutine. This does not mean that there is an underlying flaw in the compile sequence. It simply means that the ptxinfo mechanism cannot accurately roll up the resource requirements of the kernel and all of it's called subroutines, at that point in time.
The ptxinfo output also does not declare the total amount of shared memory used by a device subroutine, when it is compiling that subroutine, in rdc mode.
If you turn off the rdc mode:
–Mcuda=nordc
I believe you will see an accurate accounting of the shared memory used by a kernel plus all of its called subroutines, given a few caveats, one of which is that the compiler is able to successfully inline your called subroutines (pretty likely, and the accounting should still work even if it can't) another of which is that you are working with a kernel plus all of its called subroutines in the same file (i.e. translation unit). If you have kernels that are calling device subroutines in different translation units, then the rdc option is the only way to make it work.
Shared memory will still be appropriately allocated for your code at runtime, regardless (assuming you have not violated the total amount of shared memory available). You can also get an accurate reading of the shared memory used by a kernel by profiling your code, using a profiler such as nvvp or nvprof.
If this explanation doesn't describe what you are seeing, I would suggest providing a complete sample code, as well as the exact compile command you are using, plus the version of PGI tools you are using. (I think it's a good suggestion for future questions as well.)

CUDA/PTX 32-bit vs. 64-bit

CUDA compilers have options for producing 32-bit or 64-bit PTX. What is the difference between these? Is it like for x86, NVidia GPUs actually have 32-bit and 64-bit ISAs? Or is it related to host code only?
Pointers are certainly the most obvious difference. 64 bit machine model enables 64-bit pointers. 64 bit pointers enable a variety of things, such as address spaces larger than 4GB, and unified virtual addressing. Unified virtual addressing in turn enables other things, such as GPUDirect Peer-to-Peer. The CUDA IPC API also depends on 64 bit machine model.
The x64 ISA is not completely different than the x86 ISA, it's mostly an extension of it. Those familiar with the x86 ISA will find the x64 ISA familiar, with natural extensions for 64-bits where needed. Likewise 64 bit machine model is an extension of the capabilities of the PTX ISA to 64-bits. Most PTX instructions work exactly the same way.
32 bit machine model can handle 64 bit data types (such as double and long long), so frequently there don't need to be any changes to properly written CUDA C/C++ source code to compile for 32 bit machine model or 64 bit machine model. If you program directly in PTX, you may have to account for the pointer size differences, at least.

Compiling CUDA program for a GeForce 310 (compute capability 1.2) with unmatched options "-arch=compute_20 -code=sm_20"

I'm compiling a CUDA program using nvcc with options -arch=20 -code=20 for a GeForce 310 GPU having compute capability 1.2. The program seems to run normally as follows.
wangli#wangli-desktop:~/wangliC2050/1D-EncodeV6.1$ make
nvcc -O --ptxas-options=-v 1D-EncodeV6.1.cu -o 1D-EncodeV6.1 -I../../NVIDIA_GPU_Computing_SDK/C/common/inc -I../../NVIDIA_GPU_Computing_SDK/shared/inc -arch=compute_20 -code=sm_20
ptxas info : Compiling entry function '_Z6EncodePhPjS0_S_S_' for 'sm_20'
ptxas info : Function properties for _Z6EncodePhPjS0_S_S_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 14 registers, 52 bytes cmem[0]
wangli#wangli-desktop:~/wangliC2050/1D-EncodeV6.1$ ./1D-EncodeV6.1
########################### Encoding start (loopCount=10)#######################
#p n size averageTime(s) averageThroughput(MB/s) errorRate(0~1)
#================= Encode on GPU v6.1 ===============
4 4 4 0.000294 0.051837 100.000000
#################### Encoding stop #########################
So, I wonder:
Why this program could run on a GeForce 310 with nvcc options -arch=compute_20 -code=sm_20 which do not match the compute capability 1.2 of the card?
What will happen if the value of the -arch option will differ from that of the -code option?
Thanks.
A CUDA executable typically contains 2 types of program data: SASS code which is basically GPU machine code, and PTX which is an intermediate code (although it's pretty close to machine code). As long as PTX code is present in the executable, then if the driver decides that a proper SASS binary is not available for the GPU that the code will actually run on, it will do a "JIT-compile" step at application launch, to create the necessary binary code appropriate for the device in question, using the PTX code in the application package.
This is what is happening in your case.
If arch != code, then you're creating device code that architecturally conforms to the arch type, but is compiled to use machine level instructions that are associated with the code type. For example, if I compile for arch = 1.2 and code = 2.0, I cannot use double types (they will be demoted to float, because double is not supported in a 1.2 architecture) but the SASS machine code generated will be ready to execute on a cc 2.0 device, and will not require a JIT-compile step for that kind of device.
The NVCC manual has more information particularly the section on steering code generation.

CUDA PTX code %envreg<32> special registers

I tried to run a PTX assembly code generated by a .cl kernel with the CUDA driver API. The steps i took were these ( standard opencl procedure ):
1) Load .cl kernel
2) JIT compile it
3) Get the compiled ptx code and save it.
So far so good.
I noticed some special registers inside ptx assembly, %envreg3, %envreg6 etc. The problem is that these registers are not set ( according to ptx_isa these registers are set by the driver before the kernel launch ) when i try to execute the code with the driver API. So the code is falling into an infinite loop, and fails to run corectly. But if i manually set the values ( nore exactly i replace %envreg6 with the blocksize inside ptx ), the code is executing and i get the correct results ( correct compared with the cpu results ).
Does anyone know how we can set values to these registers, or maybee if i am missing something? i.e. a flag on cuLaunchKernel, that sets values to these registers?
You are trying to compile an OpenCL kernel and run it using the CUDA driver API. The NVIDIA driver/compiler interface is different between OpenCL and CUDA, so what you want to do is not supported and fundamentally cannot work.
Presumably, the only workaround would be the one you found: to patch the PTX code. But I'm afraid this might not work in the general case.
Edit:
Specifically, OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.