Determining shared memory usage in CUDA Fortran - cuda

I've been writing some basic CUDA Fortran code. I would like to be able to determine the amount of shared memory my program uses per thread block (for occupancy calculation). I have been compiling with -Mcuda=ptxinfo in the hope of finding this information. The compilation output ends with
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1128 bytes spill stores, 604 bytes spill loads
ptxas info : Used 63 registers, 96 bytes smem, 320 bytes cmem[0]
which is the only place in the output that smem is mentioned. There is one array in the global subroutine main_kernel with the shared attribute. If I remove the shared attribute then I get
ptxas info : Function properties for device_procedures_main_kernel_
432 bytes stack frame, 1124 bytes spill stores, 532 bytes spill loads
ptxas info : Used 63 registers, 320 bytes cmem[0]
The smem has disappeared. It seems that only shared memory in main_kernel is being counted: device subroutines in my code use variables with the shared attribute but these don't appear to be mentioned in the output e.g the device subroutine evalfuncs includes shared variable declarations but the relevant output is
ptxas info : Function properties for device_procedures_evalfuncs_
504 bytes stack frame, 1140 bytes spill stores, 508 bytes spill loads
Do all variables with the shared attribute need to be declared in a global subroutine?

Do all variables with the shared attribute need to be declared in a global subroutine?
No.
You haven't shown an example code, your compile command, nor have you identified the version of the PGI compiler tools you are using. However, the most likely explanation I can think of for what you are seeing is that as of PGI 14.x, the default CUDA compile option is to generate relocatable device code. This is documented in section 2.2.3 of the current PGI release notes:
2.2.3. Relocatable Device Code
An rdc option is available for the –ta=tesla and –Mcuda flags that specifies to generate
relocatable device code. Starting in PGI 14.1 on Linux and in PGI 14.2 on Windows, the default
code generation and linking mode for Tesla-target OpenACC and CUDA Fortran is rdc,
relocatable device code.
You can disable the default and enable the old behavior and non-relocatable code by specifying
any of the following: –ta=tesla:nordc, –Mcuda=nordc, or by specifying any 1.x compute
capability or any Radeon target.
So a specific option to (disable)enable this is:
–Mcuda=(no)rdc
(note that -Mcuda=rdc is the default, if you don't specify this option)
CUDA Fortran separates Fortran host code from device code. For the device code, the CUDA Fortran compiler does a CUDA Fortran->CUDA C conversion, and passes the auto-generated CUDA C code to the CUDA C compiler. Therefore, the behavior and expectations of switches like rdc and ptxinfo are derived from the behavior of the underlying equivalent CUDA compiler options (-rdc=true and -Xptxas -v, respectively).
When CUDA device code is compiled without the rdc option, the compiler will normally try to inline device (sub)routines that are called from a kernel, into the main kernel code. Therefore, when the compiler is generating the ptxinfo, it can determine all resource requirements (e.g. shared memory, registers, etc.) when it is compiling (ptx assembly) the kernel code.
When the rdc option is specified, however, the compiler may (depending on some other switches and function attributes) leave the device subroutines as separately callable routines with their own entry point (i.e. not inlined). In that scenario, when the device compiler is compiling the kernel code, the call to the device subroutine just looks like a call instruction, and the compiler (at that point) has no visibility into the resource usage requirements of the device subroutine. This does not mean that there is an underlying flaw in the compile sequence. It simply means that the ptxinfo mechanism cannot accurately roll up the resource requirements of the kernel and all of it's called subroutines, at that point in time.
The ptxinfo output also does not declare the total amount of shared memory used by a device subroutine, when it is compiling that subroutine, in rdc mode.
If you turn off the rdc mode:
–Mcuda=nordc
I believe you will see an accurate accounting of the shared memory used by a kernel plus all of its called subroutines, given a few caveats, one of which is that the compiler is able to successfully inline your called subroutines (pretty likely, and the accounting should still work even if it can't) another of which is that you are working with a kernel plus all of its called subroutines in the same file (i.e. translation unit). If you have kernels that are calling device subroutines in different translation units, then the rdc option is the only way to make it work.
Shared memory will still be appropriately allocated for your code at runtime, regardless (assuming you have not violated the total amount of shared memory available). You can also get an accurate reading of the shared memory used by a kernel by profiling your code, using a profiler such as nvvp or nvprof.
If this explanation doesn't describe what you are seeing, I would suggest providing a complete sample code, as well as the exact compile command you are using, plus the version of PGI tools you are using. (I think it's a good suggestion for future questions as well.)

Related

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered.
Mark Harris wrote that
OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
So, do the %envN registers make up the "virtual grid index"? And what does each of these registers hold?
The extent of the answer that can be authoritatively given is what is in the PTX documentation:
A set of 32 pre-defined read-only registers used to capture execution environment of PTX program outside of PTX virtual machine. These registers are initialized by the driver prior to kernel launch and can contain cta-wide or grid-wide values.
Anything beyond that would have to be:
discovered via reverse engineering or disclosed by someone with authoritative/unpublished knowledge
subject to change (being undocumented)
evidently under control of the driver, which means that for a different driver (e.g. CUDA vs. OpenCL) the contents and/or interpretation might be different.
If you think that NVIDIA documentation should be improved in any way, my suggestion would be to file a bug.

Problems with maxrregcount and dynamic parallelism

I am trying to estimate the effect of restricting register usage on achieved occupancy of the application. While running my experiments, when I tried to restrict the number of registers of cdpBezierTessellation application found in Nvidia samples, I got an error.
Flag added to nvcc: -maxrregcount 16
Error: nvlink error : entry function '_Z21computeBezierLinesCDPP10BezierLinei' with max regcount of 16 calls function 'cudaMalloc' with regcount of 18
I don't understand exactly why this is happening. Can anyone help me with this?
As commenters have said, the linker error message is very clear in telling you what is happening. You are trying to compile your kernel (computeBezierLinesCDP()) telling it that it may use a maximum of 16 registers, however then when you come to the link step (which is after compilation) the linker finds that one of the functions you are calling from within the kernel (cudaMalloc()) uses 18 registers. This is a constraint the linker is clearly unable to satisfy!
Since you cannot reduce the number of registers used by cudaMalloc() (since it's a pre-compiled library routine), you need to increase your register limit.
If you really need to constrain your kernel to 16 registers then you would need to avoid calling cudaMalloc() (and any other routine that uses more registers). You may be able to avoid allocating memory from within your kernel by pre-allocating from the host.

How does the ABI defines the number of registers in GPU?

There is a line in CUDA Compiler Driver NVCC - Options for steering GPU code generation which is ambiguous to me:
Value less than the minimum registers required by ABI will be bumped up by the compiler to ABI minimum limit.
Does the ABI have any standard or limitations for the number of registers that __global__ and __device__ functions use?
I think (can't find a reference right now) the CUDA ABI requires at least 16 registers. So if you specificy a lower register count (e.g. with -maxrregcount) the compiler will bump the specified limit up to the minimum required by the ABI, and will print an advisory message stating that it did so. As for the maximum number of 32-bit registers available per thread, it is GPU architecture dependent: 124 registers for sm_1x, 63 registers for sm_2x, and 254 registers for sm_3x.
Generally speaking, an ABI (application binary interface) is an architecture specific convention for storage layout, passing of arguments to functions, passing of function results back to the caller etc.. ABIs (including x86_64, ARM) often designate specific registers for specific tasks such as stack pointer, function return value, function arguments etc. Since the GPU architecture allows a variable number of registers per thread, use of the ABI requires a minimal number of registers to be present to fill these defined roles. If I recall correctly, CUDA introduced an ABI with version 3.0, which was the first version to support Fermi-class GPUs.
The ABI requires compute capability 2.0 or higher. Older GPU architecture lacked hardware features required for the ABI. Most of the newer CUDA features, such as device-side printf() and malloc(), called functions, separate compilation, etc rely on and require the use of the ABI, and it is used by default in compiler generated code for sm_20 and above. You can disable the use of the ABI with -Xptxas -abi=no. I would strongly advise against doing that.

CUDA constant memory banks

When we check the register usage by using xptxas we see something like this:
ptxas info : Used 63 registers, 244 bytes cmem[0], 51220 bytes cmem[2], 24 bytes cmem[14], 20 bytes cmem[16]
I wonder if currently there is any documentation that clearly explains cmem[x]. What is the point of separating constant memory into multiple banks, how many banks are there in total, and what are other banks other than 0, 2, 14, 16 used for?
as a side note, #njuffa (special thanks to you) previously explained on nvidia's forum what is bank 0,2,14,16:
Used constant memory is partitioned in constant program ‘variables’ (bank 1), plus compiler generated constants (bank 14).
cmem[0]:kernel arguments
cmem[2]:user defined constant objects
cmem[16]:compiler generated constants (some of which may correspond to literal constants in the source code)
The usage of GPU constant banks by CUDA is not officially documented to my knowledge. The number and usage of constant banks does differ between GPU generations. These are low-level implementation details that programmers do not have to worry about.
The usage of constants banks can be reversed engineered, if so desired, by looking at the machine code (SASS) generated for a given platform. In fact, this is how I came up with the information cited in the original question (this information came from an NVIDIA developer forum post of mine). As I recall, the information I gave there was based on adhoc reverse engineering specifically applied to Fermi-class devices, but I am unable to verify this at this time as the forums are inaccessible at the moment.
One reason for having multiple constant banks is to reserve the user visible constant memory for the use of CUDA programmers, while storing additional read-only information provided by hardware or tools in additional constant banks.
Note that the CUDA math library is provided as source files and the functions get inlined into user code, therefore constant memory usage of CUDA math library functions is included in the statistics for the user-visible constant memory.
Please see "Miscellaneous NVCC Usage".
They mention, that the constant bank allocation is profile-specific.
In the PTX guide, they say that apart from 64KB constant memory, they had 10 more banks for constant memory. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters.
I guess, that profile given for nvcc will take care of what constants go into which memory. Anyway, we don't need to worry if each constant memory cmem[n] is less than 64KB, because each bank is of size 64KB and common to all threads in grid.

CUDA shared memory address space vs. global memory

To avoid really long and incohesive functions I am calling
a number of device functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the device functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a global function
how can other device functions that I pass a pointer distinguish
between the possible address types (global device or shared mem) that
the pointer could refer to.
Note it is invalid to decorate the formal parameters with a shared modifier
according to the 'CUDA programming guide'. The only way imhoit could be
implemented is
a) by putting markers on the allocated memory
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for
global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively
without inlining all functions into the main kernel?
===========================================================================================
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all device functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?
If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.
Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?
CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.
This depends on the compute capability of your CUDA device. For devices of compute capability <2.0, the compiler has to decide at compile time whether a pointer points to shared or global memory and issue separate instructions. This is not required for devices with compute capability >= 2.0.
By default, all function calls within a kernel are inlined and the compiler can then, in most cases, use flow analysis to see if something is shared or global. If you're compiling for a device of compute capability <2.0, you may have encountered the warning warning : Cannot tell what pointer points to, assuming global memory space. This is what you get when the compiler can't follow your pointers around correctly.