CUDA constant memory banks - cuda

When we check the register usage by using xptxas we see something like this:
ptxas info : Used 63 registers, 244 bytes cmem[0], 51220 bytes cmem[2], 24 bytes cmem[14], 20 bytes cmem[16]
I wonder if currently there is any documentation that clearly explains cmem[x]. What is the point of separating constant memory into multiple banks, how many banks are there in total, and what are other banks other than 0, 2, 14, 16 used for?
as a side note, #njuffa (special thanks to you) previously explained on nvidia's forum what is bank 0,2,14,16:
Used constant memory is partitioned in constant program ‘variables’ (bank 1), plus compiler generated constants (bank 14).
cmem[0]:kernel arguments
cmem[2]:user defined constant objects
cmem[16]:compiler generated constants (some of which may correspond to literal constants in the source code)

The usage of GPU constant banks by CUDA is not officially documented to my knowledge. The number and usage of constant banks does differ between GPU generations. These are low-level implementation details that programmers do not have to worry about.
The usage of constants banks can be reversed engineered, if so desired, by looking at the machine code (SASS) generated for a given platform. In fact, this is how I came up with the information cited in the original question (this information came from an NVIDIA developer forum post of mine). As I recall, the information I gave there was based on adhoc reverse engineering specifically applied to Fermi-class devices, but I am unable to verify this at this time as the forums are inaccessible at the moment.
One reason for having multiple constant banks is to reserve the user visible constant memory for the use of CUDA programmers, while storing additional read-only information provided by hardware or tools in additional constant banks.
Note that the CUDA math library is provided as source files and the functions get inlined into user code, therefore constant memory usage of CUDA math library functions is included in the statistics for the user-visible constant memory.

Please see "Miscellaneous NVCC Usage".
They mention, that the constant bank allocation is profile-specific.
In the PTX guide, they say that apart from 64KB constant memory, they had 10 more banks for constant memory. The driver may allocate and initialize constant buffers in these regions and pass pointers to the buffers as kernel function parameters.
I guess, that profile given for nvcc will take care of what constants go into which memory. Anyway, we don't need to worry if each constant memory cmem[n] is less than 64KB, because each bank is of size 64KB and common to all threads in grid.

Related

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN registers was not properly answered.
Mark Harris wrote that
OpenCL supports larger grids than most NVIDIA GPUs support, so grid sizes need to be virtualized by dividing across multiple actual grid launches, and so offsets are necessary. Also in OpenCL, indices do not necessarily start from (0, 0, 0), the user can specify offsets which the driver must pass to the kernel. Therefore the registers initialized for OpenCL and CUDA C launches are different.
So, do the %envN registers make up the "virtual grid index"? And what does each of these registers hold?
The extent of the answer that can be authoritatively given is what is in the PTX documentation:
A set of 32 pre-defined read-only registers used to capture execution environment of PTX program outside of PTX virtual machine. These registers are initialized by the driver prior to kernel launch and can contain cta-wide or grid-wide values.
Anything beyond that would have to be:
discovered via reverse engineering or disclosed by someone with authoritative/unpublished knowledge
subject to change (being undocumented)
evidently under control of the driver, which means that for a different driver (e.g. CUDA vs. OpenCL) the contents and/or interpretation might be different.
If you think that NVIDIA documentation should be improved in any way, my suggestion would be to file a bug.

What are the lifetimes for CUDA constant memory?

I'm having trouble wrapping my head around the restrictions on CUDA constant memory.
Why can't we allocate __constant__ memory at runtime? Why do I need to compile in a fixed size variable with near-global scope?
When is constant memory actually loaded, or unloaded? I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory? Related, is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
Where does constant memory reside on the chip?
I'm primarily interested in answers as they relate to Pascal and Volta.
It is probably easiest to answer these six questions in reverse order:
Where does constant memory reside on the chip?
It doesn't. Constant memory is stored in statically reserved physical memory off-chip and accessed via a per-SM cache. When the compiler can identify that a variable is stored in the logical constant memory space, it will emit specific PTX instructions which allow access to that static memory via the constant cache. Note also that there are specific reserved constant memory banks for storing kernel arguments on all currently supported architectures.
Is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
No. But there also isn't "binding" or "unbinding" because reservations are performed statically. The only runtime costs are host to device memory transfers and the cost of loading the symbols into the context as part of context establishment.
I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory?
No. There is only one "allocation" for the entire GPU (although as noted above, there is specific constant memory banks for kernel arguments, so in some sense you could say that there is a per-kernel component of constant memory).
When is constant memory actually loaded, or unloaded?
It depends what you mean by "loaded" and "unloaded". Loading is really a two phase process -- firstly retrieve the symbol and load it into the context (if you use the runtime API this is done automagically) and secondly any user runtime operations to alter the contents of the constant memory via cudaMemcpytoSymbol.
Why do I need to compile in a fixed size variable with near-global scope?
As already noted, constant memory is basically a logical address space in the PTX memory hierarchy which is reflected by a finite size reserved area of the GPU DRAM map and which requires the compiler to emit specific instructions to access uniformly via a dedicated on chip cache or caches. Given its static, compiler analysis driven nature, it makes sense that its implementation in the language would also be primarily static.
Why can't we allocate __constant__ memory at runtime?
Primarily because NVIDIA have chosen not to expose it. But given all the constraints outlined above, I don't think it is an outrageously poor choice. Some of this might well be historic, as constant memory has been part of the CUDA design since the beginning. Almost all of the original features and functionality in the CUDA design map to hardware features which existed for the hardware's first purpose, which was the graphics APIs the GPUs were designed to support. So some of what you are asking about might well be tied to historical features or limitations of either OpenGL or Direct 3D, but I am not familiar enough with either to say for sure.

Weak guarantees for non-atomic writes on GPUs?

OpenCL and CUDA have included atomic operations for several years now (although obviously not every CUDA or OpenCL device supports these). But - my question is about the possibility of "living with" races due to non-atomic writes.
Suppose several threads in a grid all write to the same location in global memory. Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?
Relevant parameters for this question (choose any combination(s), edit except nVIDIA+CUDA which already got an answer):
Memory space: Global memory only; this question is not about local/shared/private memory.
Alignment: Within a single memory write width (e.g. 128 bits on nVIDIA GPUs)
GPU Manufacturer: AMD / nVIDIA
Programming framework: CUDA / OpenCL
Position of store instruction in code: Same line of code for all threads / different lines of code.
Write destination: Fixed address / fixed offset from the address of a function parameter / completely dynamic
Write width: 8 / 32 / 64 bits.
Are we guaranteed that, when kernel execution has concluded, the results of one of these writes will be present in that location, rather than some junk?
For current CUDA GPUs, and I'm pretty sure for NVIDIA GPUs with OpenCL, the answer is yes. Most of my terminology below will have CUDA in view. If you require an exhaustive answer for both CUDA and OpenCL, let me know, and I'll delete this answer. Very similar questions to this one have been asked, and answered, before anyway. Here's another, and I'm sure there are others.
When multiple "simultaneous" writes occur to the same location, one of them will win, intact.
Which one will win is undefined. The behavior of the non-winning writes is also undefined (they may occur, but be replaced by the winner, or they may not occur at all.) The actual contents of the memory location may transit through various values (such as the original value, plus any of the valid written values), but the transit will not pass through "junk" values (i.e. values that were not already there and were not written by any thread.) The transit ends up at the "winner", eventually.
Example 1:
Location X contains zero. Threads 1,5,32, 30000, and 450000 all write one to that location. If there is no other write traffic to that location, that location will eventually contain the value of one (at kernel termination, or earlier).
Example 2:
Location X contains 5. Thread 32 writes 1 to X. Thread 90303 writes 7 to X. Thread 432322 writes 972 to X. If there is no other write traffic to that location, upon kernel termination, or earlier, location X will contain either 1, 7 or 972. It will not contain any other value, including 5.
I'm assuming X is in global memory, and all traffic to it is naturally aligned to it, and all traffic to it is of the same size, although these principles apply to shared memory as well. I'm also assuming you have not violated CUDA programming principles, such as the requirement for naturally aligned traffic to device memory locations. The transactions I have in view here are those transactions that originate from a single SASS instruction (per thread) Such transactions can have a width of 1,2,4,or 8bytes. The claims I've made here apply whether the writes are originating from "the same line of code" or "different lines".
These claims are based on the PTX memory consistency model, and therefore the "correctness" is ensured by the GPU hardware, not by the compiler, the CUDA programming model, or the C++ standard that CUDA is based on.
This is a fairly complex topic (especially when we factor in cache behavior, and what to expect when we throw reads in the mix), but "junk" values should never occur. The only values that should occur in global memory are those values that were there to begin with, or those values that were written by some thread, somewhere.

total number of registers

I wanted to ask.We say that using --ptxas-options=-v doesn't give the exact number of registers that our program uses.
1) Then , how am I going to supply the occupancu calculator with registers per thread and shared memory per block?
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!)
(the same applies for the shared memory)
1) Then , how am I going to supply the occupancy calculator with registers per thread and shared memory per block?
The only other thing needed should be rounding up (if necessary) the output of ptxas to an even granularity of register allocation, which varies by device (see Greg's answer here) I think the common register allocation granularities are 4 and 8, but I don't have a table of register allocation granularity by compute capability.
I think shared memory also has an allocation granularity. Since the max number of threadblocks per SM is limited anyway, this should only matter (for occupancy) if your allocation/usage is within a granular amount of exceeding the limit for however many blocks you are otherwise limited to.
I think in most cases you'll get a pretty good feel by using the numbers from ptxas without rounding. If you feel you need this level of accuracy in the occupancy calculator, asking a nice directed question like "what are the allocation granularities for registers and shared memory for various GPUs" may get someone like Greg to give you a crisp answer.
2) In my program I use also thrust calls which generate ptx code.I am having 2 kernels but I can see the thrust functions to produce ptx as well.So , I am taking into account these numbers also when I am counting the total number of registers I use? (I think yes!) (the same applies for the shared memory)
Fundamentally I believe this thinking is incorrect. The only place I could see where it might matter is if you are running concurrent kernels, and I doubt that is the case since you mention thrust. The only figures that matter for occupancy are the metrics for a single kernel launch. You do not add threads, or registers, or shared memory across different kernels, to calculate resource usage. When a kernel completes execution, it releases its resource usage, at least for these resource types (registers, shared memory, threads).

maxreg = 51 --> uses 48 registers. If maxreg = 0 --> uses 47 registers. Also "limited" version is faster

I'm curious why if I put an upper limit in register usage (51 in my exemple) it can produce a higher register kernel than if i let the limit unbounded.
Also the higher register seems faster (10us over 700).
What phases in optimization stages changes?
I cannot provide much insight into the actual CUDA compiler and its stages, but some common sense reasoning based on CUDA's execution architecture.
When not setting a maximum register number the compiler doesn't know what your target register number is and has to assume that you need to use as few registers as possible or employs some other heuristic. In general minimizing register usage per-thread means there are enough registers for more threads on a single core and thus maximizes utilization because more thread blocks can be resident on a single core, which is good.
But when you give a maximum register usage, the compiler knows that this is your maximum and assumes that up to that maximum it can use as much registers as possible. The reason for this is that the points where register occupation is too high and there are not enough registers for yet another thread block are actually hard limits. When there are not enough registers for yet another block once a single thread uses 65 registers, then it just doesn't matter if it uses 63 or 64 registers, as long as it doesn't use 65. So the compiler tries to use as much registers as possible (up to the maximum, of course), which is desirable, because registers are the fastest memory type you can get. But this reasoning can only be applied when the compiler knows this hard limit (i.e. you tell him), otherwise it has to employ some heuristics, which might not always be optimal.
And the reason for why the version with 48 registers is faster than the one with 47 is likely because it, well, uses more registers. If not enough registers are available data has to be swapped out into local memory or copied repeatedly into temporary registers from other registers.
In the end this all makes perfect sense, because the more information you give the compiler (by setting your optimal register maximum), the better it can optimize and the more efficient the resulting code should be. And especially with GPU computing it is usually desirable to tune your kernels to the actual hardware and its resources as best as possible.