When passing parameter by value to kernel function, where are parameters copied? - cuda

I'm beginner at CUDA programming and have a question.
When I pass parameters by value, like this:
__global__ void add(int a, int b, int *c) {
// some operations
}
Since variable a and b are passed to kernel function add as copied value in function call stack, I guessed some memory space would be needed to copy in.
If I'm right, is that additional memory space where those parameters are copied
in GPU or in Host's main memory?
The reason why I wonder this problem is that I should pass a big struct to kernel function.
I also thought pass a pointer of the struct, but these way seems to be required to call cudamalloc for the struct and each member variables.

The very short answer is that all arguments to CUDA kernels are passed by value, and those arguments are copied by the host via an API into a dedicated memory argument buffer on the GPU. At present, this buffer is stored in constant memory and there is a limit of 4kb of arguments per kernel launch -- see here.
In more details, the PTX standard (technically since compute capability 2.0 hardware and the CUDA ABI appeared) defines a dedicated logical state space call .param which hold kernel and device parameter arguments. See here. Quoting from that documentation:
Each kernel function definition includes an optional list of
parameters. These parameters are addressable, read-only variables
declared in the .param state space. Values passed from the host to the
kernel are accessed through these parameter variables using ld.param
instructions. The kernel parameter variables are shared across all
CTAs within a grid.
It further notes that:
Note: The location of parameter space is implementation specific. For example, in some implementations kernel parameters reside in
global memory. No access protection is provided between parameter and
global space in this case. Similarly, function parameters are mapped
to parameter passing registers and/or stack locations based on the
function calling conventions of the Application Binary Interface
(ABI).
So the precise location of the parameter state space is implementation specific. In the first iteration of CUDA hardware, it actually mapped to shared memory for kernel arguments and registers for device function arguments. However, since compute 2.0 hardware and the PTX 2.2 standard, it maps to constant memory for kernels under most circumstances. The documentation says the following on the matter:
The constant (.const) state space is a read-only memory initialized
by the host. Constant memory is accessed with a ld.const
instruction. Constant memory is restricted in size, currently limited
to 64 KB which can be used to hold statically-sized constant
variables. There is an additional 640 KB of constant memory,
organized as ten independent 64 KB regions. The driver may allocate
and initialize constant buffers in these regions and pass pointers to
the buffers as kernel function parameters. Since the ten regions are
not contiguous, the driver must ensure that constant buffers are
allocated so that each buffer fits entirely within a 64 KB region and
does not span a region boundary.
Statically-sized constant variables have an optional variable initializer; constant variables with no explicit initializer are
initialized to zero by default. Constant buffers allocated by the
driver are initialized by the host, and pointers to such buffers are
passed to the kernel as parameters.
[Emphasis mine]
So while kernel arguments are stored in constant memory, this is not the same constant memory which maps to the .const state space accessible by defining a variable as __constant__ in CUDA C or the equivalent in Fortran or Python. Rather, it is an internal pool of device memory managed by the driver and not directly accessible to the programmer.

Related

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).
So, I have two questions:
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?
I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.
How can I initialize (populate) those arrays from values that are computed at the runtime on the host?
As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.
Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.
For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:
auto kernel = R"(
..
__constant__ ##Type## buffer[##SIZE##]={
##elm##
};
..
__global__ void test(int * input)
{ }
)";
then replace ##-pattern words with size and element value information in run-time and compile like this:
__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.
Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.
Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?
Disclaimer: I cannot help you with CUDA.
For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.
Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
constant float* fixed_parameters,
global const uint* dynamic_input_data,
global uint* restrict output_data)
{
cl_mem fixed_parameter_buffer = clCreateBuffer(
cl_context,
CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
NULL);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);
Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.

Persistent buffers in CUDA

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

cudamalloc vs __device__ variables [duplicate]

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations

Utilizing Register memory in CUDA

I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.

CUDA shared memory address space vs. global memory

To avoid really long and incohesive functions I am calling
a number of device functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the device functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a global function
how can other device functions that I pass a pointer distinguish
between the possible address types (global device or shared mem) that
the pointer could refer to.
Note it is invalid to decorate the formal parameters with a shared modifier
according to the 'CUDA programming guide'. The only way imhoit could be
implemented is
a) by putting markers on the allocated memory
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for
global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively
without inlining all functions into the main kernel?
===========================================================================================
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all device functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?
If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.
Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?
CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.
This depends on the compute capability of your CUDA device. For devices of compute capability <2.0, the compiler has to decide at compile time whether a pointer points to shared or global memory and issue separate instructions. This is not required for devices with compute capability >= 2.0.
By default, all function calls within a kernel are inlined and the compiler can then, in most cases, use flow analysis to see if something is shared or global. If you're compiling for a device of compute capability <2.0, you may have encountered the warning warning : Cannot tell what pointer points to, assuming global memory space. This is what you get when the compiler can't follow your pointers around correctly.