CUDA shared memory address space vs. global memory - function

To avoid really long and incohesive functions I am calling
a number of device functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the device functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a global function
how can other device functions that I pass a pointer distinguish
between the possible address types (global device or shared mem) that
the pointer could refer to.
Note it is invalid to decorate the formal parameters with a shared modifier
according to the 'CUDA programming guide'. The only way imhoit could be
implemented is
a) by putting markers on the allocated memory
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for
global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively
without inlining all functions into the main kernel?
===========================================================================================
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all device functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?

If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.
Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?
CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.

This depends on the compute capability of your CUDA device. For devices of compute capability <2.0, the compiler has to decide at compile time whether a pointer points to shared or global memory and issue separate instructions. This is not required for devices with compute capability >= 2.0.
By default, all function calls within a kernel are inlined and the compiler can then, in most cases, use flow analysis to see if something is shared or global. If you're compiling for a device of compute capability <2.0, you may have encountered the warning warning : Cannot tell what pointer points to, assuming global memory space. This is what you get when the compiler can't follow your pointers around correctly.

Related

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).
So, I have two questions:
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?
I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.
How can I initialize (populate) those arrays from values that are computed at the runtime on the host?
As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.
Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.
For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:
auto kernel = R"(
..
__constant__ ##Type## buffer[##SIZE##]={
##elm##
};
..
__global__ void test(int * input)
{ }
)";
then replace ##-pattern words with size and element value information in run-time and compile like this:
__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.
Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.
Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?
Disclaimer: I cannot help you with CUDA.
For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.
Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
constant float* fixed_parameters,
global const uint* dynamic_input_data,
global uint* restrict output_data)
{
cl_mem fixed_parameter_buffer = clCreateBuffer(
cl_context,
CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
NULL);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);
Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.

cudamalloc vs __device__ variables [duplicate]

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations

What are the lifetimes for CUDA constant memory?

I'm having trouble wrapping my head around the restrictions on CUDA constant memory.
Why can't we allocate __constant__ memory at runtime? Why do I need to compile in a fixed size variable with near-global scope?
When is constant memory actually loaded, or unloaded? I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory? Related, is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
Where does constant memory reside on the chip?
I'm primarily interested in answers as they relate to Pascal and Volta.
It is probably easiest to answer these six questions in reverse order:
Where does constant memory reside on the chip?
It doesn't. Constant memory is stored in statically reserved physical memory off-chip and accessed via a per-SM cache. When the compiler can identify that a variable is stored in the logical constant memory space, it will emit specific PTX instructions which allow access to that static memory via the constant cache. Note also that there are specific reserved constant memory banks for storing kernel arguments on all currently supported architectures.
Is there a cost to binding, and unbinding similar to the old cost of binding textures (aka, using textures added a cost to every kernel invocation)?
No. But there also isn't "binding" or "unbinding" because reservations are performed statically. The only runtime costs are host to device memory transfers and the cost of loading the symbols into the context as part of context establishment.
I understand that cudaMemcpytoSymbol is used to load the particular array, but does each kernel use its own allocation of constant memory?
No. There is only one "allocation" for the entire GPU (although as noted above, there is specific constant memory banks for kernel arguments, so in some sense you could say that there is a per-kernel component of constant memory).
When is constant memory actually loaded, or unloaded?
It depends what you mean by "loaded" and "unloaded". Loading is really a two phase process -- firstly retrieve the symbol and load it into the context (if you use the runtime API this is done automagically) and secondly any user runtime operations to alter the contents of the constant memory via cudaMemcpytoSymbol.
Why do I need to compile in a fixed size variable with near-global scope?
As already noted, constant memory is basically a logical address space in the PTX memory hierarchy which is reflected by a finite size reserved area of the GPU DRAM map and which requires the compiler to emit specific instructions to access uniformly via a dedicated on chip cache or caches. Given its static, compiler analysis driven nature, it makes sense that its implementation in the language would also be primarily static.
Why can't we allocate __constant__ memory at runtime?
Primarily because NVIDIA have chosen not to expose it. But given all the constraints outlined above, I don't think it is an outrageously poor choice. Some of this might well be historic, as constant memory has been part of the CUDA design since the beginning. Almost all of the original features and functionality in the CUDA design map to hardware features which existed for the hardware's first purpose, which was the graphics APIs the GPUs were designed to support. So some of what you are asking about might well be tied to historical features or limitations of either OpenGL or Direct 3D, but I am not familiar enough with either to say for sure.

Utilizing Register memory in CUDA

I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.

Variable in the shared and managed memory in cuda

With the unified memory feature now in CUDA, variables can be in the managed memory and this makes the code a little simpler. Whereas the shared memory is shared between threads in a thread block. See section 3.2 in CUDA Fortran.
My question is can a variable be in both the shared and managed memory? They would be managed in the host but on the device, they would be sharedWhat type of behaviour maybe expected of this type of variable ?
I am using CUDA Fortran. I ask this question because, declaring the variable as managed makes it easier for me to code whereas making it shared in the device makes it faster than the global device memory.
I could not find anything that gave me a definitive answer in the documentation.
My question is can a variable be in both the shared and managed memory?
No, it's not possible.
Managed memory is created with either a static allocator (__managed__ in CUDA C, or managed attribute in CUDA fortran) or a dynamic allocator (cudaMallocManaged in CUDA C, or managed (allocatable) attribute in CUDA fortran).
Both of these are associated with the logical global memory space in CUDA. The __shared__ (or shared) memory space is a separate logical space, and must be used (allocated, accessed) explicitly, independent of any global space usage.