Cuda optim for maxwell - cuda

I am trying to understand the parallel forall post on instruction level profiling. And especially the following lines in section Reducing Memory Dependency Stalls:
NVIDIA GPUs do not have indexed register files, so if a stack array is accessed with dynamic indices, the compiler must allocate the array in local memory. In the Maxwell architecture, local memory stores are not cached in L1 and hence the latency of local memory loads after stores is significant.
I understand what register files are but what does it mean that they are not indexed? And why does it prevent the compiler to store a stack array accessed with dynamic indices?
The quote says that the array will be stored in local memory. What block does this local memory correspond to in the architecture below?

... what does it mean that they are not indexed
It means that indirect addressing of registers is not supported. So it isn't possible to index from one register (theoretically the register holding the first element of an array) to another arbitrary register. As a result the compiler can't generate code for non static indexing of an array stored in registers.
What block does this local memory correspond to in the architecture below?
It doesn't correspond to any of them. Local memory is stored in DRAM, not on the GPU itself.

Related

Are load and store operations in shared memory atomic?

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.
On the one hand, it seems that any load/store is compiled to the PTX instruction ld.weak.shared.cta which does not enforce atomicity. But on the other hand, it is said in the manual that loads are serialized (9.2.3.1):
However, if multiple addresses of a memory request map to the same memory bank, the accesses are serialized
which hints to load/store atomicity "per-default" in shared memory. Thus, would the instructions ld.weak.shared.cta and ld.relaxed.shared.cta have the same effect?
Or is it an information the compiler needs anyway to avoid optimizing away load and store?
More generally, supposing variables are properly aligned, would __shared__ int and __shared__ cuda::atomic<int, cuda::thread_scope_block> provide the same guarantees (when considering only load and store operations)?
Bonus (relevant) question: with a primitive data type properly aligned, stored in global memory, accessed by threads from a single block, are __device__ int and __device__ cuda::atomic<int, cuda::thread_scope_block> equivalent in term of atomicity of load/store operations?
Thanks for any insight.
Serialization does not imply atomicity: thread A writes the 2 first bytes of an integer, then thread B reads the variable a, and finally thread A writes the last 2 bytes. All of this happening in sequence (not in parallel), but still not being atomic.
Further, serialization is not guaranteed in all cases, see:
Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.
Conclusion: use atomic.

What are the lifetimes for CUDA constant memory?

I'm having trouble wrapping my head around the restrictions on CUDA constant memory.
Why can't we allocate __constant__ memory at runtime? Why do I need to compile in a fixed size variable with near-global scope?
When is constant memory actually loaded, or unloaded? I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory? Related, is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
Where does constant memory reside on the chip?
I'm primarily interested in answers as they relate to Pascal and Volta.
It is probably easiest to answer these six questions in reverse order:
Where does constant memory reside on the chip?
It doesn't. Constant memory is stored in statically reserved physical memory off-chip and accessed via a per-SM cache. When the compiler can identify that a variable is stored in the logical constant memory space, it will emit specific PTX instructions which allow access to that static memory via the constant cache. Note also that there are specific reserved constant memory banks for storing kernel arguments on all currently supported architectures.
Is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
No. But there also isn't "binding" or "unbinding" because reservations are performed statically. The only runtime costs are host to device memory transfers and the cost of loading the symbols into the context as part of context establishment.
I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory?
No. There is only one "allocation" for the entire GPU (although as noted above, there is specific constant memory banks for kernel arguments, so in some sense you could say that there is a per-kernel component of constant memory).
When is constant memory actually loaded, or unloaded?
It depends what you mean by "loaded" and "unloaded". Loading is really a two phase process -- firstly retrieve the symbol and load it into the context (if you use the runtime API this is done automagically) and secondly any user runtime operations to alter the contents of the constant memory via cudaMemcpytoSymbol.
Why do I need to compile in a fixed size variable with near-global scope?
As already noted, constant memory is basically a logical address space in the PTX memory hierarchy which is reflected by a finite size reserved area of the GPU DRAM map and which requires the compiler to emit specific instructions to access uniformly via a dedicated on chip cache or caches. Given its static, compiler analysis driven nature, it makes sense that its implementation in the language would also be primarily static.
Why can't we allocate __constant__ memory at runtime?
Primarily because NVIDIA have chosen not to expose it. But given all the constraints outlined above, I don't think it is an outrageously poor choice. Some of this might well be historic, as constant memory has been part of the CUDA design since the beginning. Almost all of the original features and functionality in the CUDA design map to hardware features which existed for the hardware's first purpose, which was the graphics APIs the GPUs were designed to support. So some of what you are asking about might well be tied to historical features or limitations of either OpenGL or Direct 3D, but I am not familiar enough with either to say for sure.

Utilizing Register memory in CUDA

I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.

CUDA thread local array

I am writing a CUDA kernel that requires maintaining a small associative array per thread. By small, I mean 8 elements max worst case, and an expected number of entries of two or so; so nothing fancy; just an array of keys and an array of values, and indexing and insertion happens by means of a loop over said arrays.
Now I do this by means of thread local memory; that is identifier[size]; where size is a compile time constant. Now ive heard that under some circumstances this memory is stored off-chip, and under some circumstances it is stored on-chip. Obviously I want the latter, under all circumstances. I understand that I can accomplish such with a block of shared mem, where I let each thread work on its own private block; but really? I dont want to share anything between threads, and it would be a horrible kludge.
What exactly are the rules for where this memory goes? I cant seem to find any word from nvidia. For the record, I am using CUDA5 and targetting Kepler.
Local variables are either stored in registers, or (cached for compute capability >=2.0) off-chip memory.
Registers are only used for arrays if all array indices are constant and can be determined at compile time, as the architecture has no means for indexed access to registers at runtime.
In you case the number of keys may be small enough to use registers (and tolerate the increase in register pressure). Unroll all loops over array accesses to allow the compiler to place the keys in registers, and use cuobjdump -sass to check it actually does.
If you don't want to spend registers, you can either choose shared memory with a per-thread offset (but check that the additional registers used to hold per-thread indices into shared memory don't outvalue the number of keys you use) as you mentioned, or do nothing and use off-chip "local" memory (really "global" memory with just a different addressing scheme) hoping for the cache to do it's work.
If you hope for the cache to hold the keys and values, and don't use much shared memory, it may be beneficial to select the 48kB cache / 16kB shared memory setting over the default 16kB/48kB split using cudaDeviceSetCacheConfig().

Is local memory access coalesced?

Suppose, I declare a local variable in a CUDA kernel function for each thread:
float f = ...; // some calculations here
Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?
I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.
CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.