Persistent buffers in CUDA - cuda

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]

What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

Related

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).
So, I have two questions:
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?
I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.
How can I initialize (populate) those arrays from values that are computed at the runtime on the host?
As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.
Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.
For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:
auto kernel = R"(
..
__constant__ ##Type## buffer[##SIZE##]={
##elm##
};
..
__global__ void test(int * input)
{ }
)";
then replace ##-pattern words with size and element value information in run-time and compile like this:
__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.
Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.
Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?
Disclaimer: I cannot help you with CUDA.
For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.
Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
constant float* fixed_parameters,
global const uint* dynamic_input_data,
global uint* restrict output_data)
{
cl_mem fixed_parameter_buffer = clCreateBuffer(
cl_context,
CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
NULL);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);
Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.

cudamalloc vs __device__ variables [duplicate]

I have a CUDA (v5.5) application that will need to use global memory. Ideally I would prefer to use constant memory, but I have exhausted constant memory and the overflow will have to be placed in global memory. I also have some variables that will need to be written to occasionally (after some reduction operations on the GPU) and I am placing this in global memory.
For reading, I will be accessing the global memory in a simple way. My kernel is called inside a for loop, and on each call of the kernel, every thread will access the exact same global memory addresses with no offsets. For writing, after each kernel call a reduction is performed on the GPU, and I have to write the results to global memory before the next iteration of my loop. There are far more reads from than writes to global memory in my application however.
My question is whether there are any advantages to using global memory declared in global (variable) scope over using dynamically allocated global memory? The amount of global memory that I need will change depending on the application, so dynamic allocation would be preferable for that reason. I know the upper limit on my global memory use however and I am more concerned with performance, so it is also possible that I could declare memory statically using a large fixed allocation that I am sure not to overflow. With performance in mind, is there any reason to prefer one form of global memory allocation over the other? Do they exist in the same physical place on the GPU and are they cached the same way, or is the cost of reading different for the two forms?
Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.
First of all you need to think of coalescing the memory access. You didn't mention about the GPU you are using. In the latest GPUs, the coal laced memory read will give same performance as that of constant memory. So always make your memory read and write in coal laced manner as possible as you can.
Another you can use texture memory (If the data size fits into it). This texture memory has some caching mechanism. This is previously used in case when global memory read was non-coalesced. But latest GPUs give almost same performance for texture and global memory.
I don't think the globally declared memory give more performance over dynamically allocated global memory, since the coalescing issue still exists. Also global memory declared in global (variable) scope is not possible in case of CUDA global memory. The variables that can declare globally (in the program) are constant memory variables and texture, which we doesn't required to pass to kernels as arguments.
for memory optimizations please see the memory optimization section in cuda c best practices guide http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#memory-optimizations

Why do I need to declare CUDA variables on the Host before allocating them on the Device

I've just started trying to learn CUDA again and came across some code I don't fully understand.
// declare GPU memory pointers
float * d_in;
float * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
When the GPU memory pointers are declared, they allocate memory on the host. The cudaMalloc calls throw away the information that d_in and d_out are pointers to floats.
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored. It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
So, what is the purpose of the original variable declarations on the host?
======================================================================
I would've thought something like this would make more sense:
// declare GPU memory pointers
cudaFloat * d_in;
cudaFloat * d_out;
// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);
This way, everything GPU related takes place on the GPU. If d_in or d_out are accidentally used in host code, an error can be thrown at compile time, since those variables wouldn't be defined on the host.
I guess what I also find confusing is that by storing device memory addresses on the host, it feels like the device isn't in fully in charge of managing its own memory. It feels like there's a risk of host code accidentally overwriting the value of either d_in or d_out either through accidentally assigning to them in host code or another more subtle error, which could cause the GPU to lose access to its own memory. Also, it seems strange that the addresses assigned to d_in & d_out are chosen by the host, instead of the device. Why should the host know anything about which addresses are/are not available on the device?
What am I failing to understand here?
I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored
That is just the C pass by reference idiom.
It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.
Ok, so let's design the API your way. Here is a typical sequence of operations on the host -- allocate some memory on the device, copy some data to that memory, launch a kernel to do something to that memory. You can think for yourself how it would be possible to do this without having the pointers to the allocated memory stored in a host variable:
cudaMalloc(somebytes);
cudaMemcpy(?????, hostdata, somebytes, cudaMemcpyHOstToDevice);
kernel<<<1,1>>>(?????);
If you can explain what should be done with ????? if we don't have the address of the memory allocation on the device stored in a host variable, then you are really onto something. If you can't, then you have deduced the basic reason why we store the return address of memory allocated on the GPU in host variables.
Further, because of the use of typed host pointers to store the addresses of device allocations, CUDA runtime API can do type checking. So this:
__global__ void kernel(double *data, int N);
// .....
int N = 1 << 20;
float * d_data;
cudaMalloc((void **)&d_data, N * sizeof(float));
kernel<<<1,1>>>(d_data, N);
can report type mismatches at compile time, which is very useful.
Your fundamental conceptual failure is mixing up host-side code and device-side code. If you call cudaMalloc() from code execution on the CPU, then, well, it's on the CPU: It's you who want to have the arguments in CPU memory, and the result in CPU memory. You asked for it. cudaMalloc has told the GPU/device how much of its (the device's) memory to allocate, but if the CPU/host wants to access that memory, it needs a way to refer to it that the device will understand. The memory location on the device is a way to do this.
Alternatively, you can call it from device-side code; then everything takes place on the GPU. (Although, frankly, I've never done it myself and it's not such a great idea except in special cases).

Utilizing Register memory in CUDA

I have some questions regarding cuda registers memory
1) Is there any way to free registers in cuda kernel? I have variables, 1D and 2D arrays in registers. (max array size 48)
2) If I use device functions, then what happens to the registers I used in device function after its execution? Will they be available for calling kernel execution or other device functions?
3) How nvcc optimizes register usage? Please share the points important w.r.t optimization of memory intensive kernel
PS: I have a complex algorithm to port to cuda which is taking a lot of registers for computation, I am trying to figure out whether to store intermediate data in register and write one kernel or store it in global memory and break algorithm in multiple kernels.
Only local variables are eligible of residing in registers (see also Declaring Variables in a CUDA kernel). You don't have direct control on which variables (scalar or static array) will reside in registers. The compiler will make it's own choices, striving for performance with respected to register sparing.
Register usage can be limited using the maxrregcount options of nvcc compiler.
You can also put most small 1D, 2D arrays in shared memory or, if accessing to constant data, put this content into constant memory (which is cached very close to register as L1 cache content).
Another way of reducing register usage when dealing with compute bound kernels in CUDA is to process data in stages, using multiple global kernel functions calls and storing intermediate results into global memory. Each kernel will use far less registers so that more active threads per SM will be able to hide load/store data movements. This technique, in combination with a proper usage of streams and asynchronous data transfers is very successful most of the time.
Regarding the use of device function, I'm not sure, but I guess registers's content of the calling function will be moved/stored into local memory (L1 cache or so), in the same way as register spilling occurs when using too many local variables (see CUDA Programming Guide -> Device Memory Accesses -> Local Memory). This operation will free up some registers for the called device function. After the device function is completed, their local variables exist no more, and registers can be now used again by the caller function and filled with the previously saved content.
Keep in mind that small device functions defined in the same source code of the global kernel could be inlined by the compiler for performance reasons: when this happen, the resulting kernel will in general require more registers.

CUDA shared memory address space vs. global memory

To avoid really long and incohesive functions I am calling
a number of device functions from a kernel. I allocate a shared
buffer at the beginning of the kernel call (which is per-thread-block)
and pass pointers to it to all the device functions that are
performing some processing steps in the kernel.
I was wondering about the following:
If I allocate a shared memory buffer in a global function
how can other device functions that I pass a pointer distinguish
between the possible address types (global device or shared mem) that
the pointer could refer to.
Note it is invalid to decorate the formal parameters with a shared modifier
according to the 'CUDA programming guide'. The only way imhoit could be
implemented is
a) by putting markers on the allocated memory
b) passing invisible parameters with the call.
c) having a virtual unified address space that has separate segments for
global and shared memory and a threshold check on the pointer can be used?
So my question is: Do I need to worry about it or how should one proceed alternatively
without inlining all functions into the main kernel?
===========================================================================================
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called
'external calls from global functions', requiring them to be inlined. This means in effect
I have to declare all device functions inline and the separation of header / source
files is broken. This is of course quite ugly, but is there an alternative?
If I allocate a shared memory buffer in a global function how can other device functions that I pass a pointer distinguish between the possible address types (global device or shared mem) that the pointer could refer to.
Note that "shared" memory, in the context of CUDA, specifically means the on-chip memory that is shared between all threads in a block. So, if you mean an array declared with the __shared__ qualifier, it normally doesn't make sense to use it for passing information between device functions (as all the threads see the very same memory). I think the compiler might put regular arrays in shared memory? Or maybe it was in the register file. Anyway, there's a good chance that it ends up in global memory, which would be an inefficient way of passing information between the device functions (especially on < 2.0 devices).
On the side I was today horrified that NVCC with CUDA Toolkit 3.0 disallows so-called 'external calls from global functions', requiring them to be inlined. This means in effect I have to declare all device functions inline and the separation of header / source files is broken. This is of course quite ugly, but is there an alternative?
CUDA does not include a linker for device code so you must keep the kernel(s) and all related device functions in the same .cu file.
This depends on the compute capability of your CUDA device. For devices of compute capability <2.0, the compiler has to decide at compile time whether a pointer points to shared or global memory and issue separate instructions. This is not required for devices with compute capability >= 2.0.
By default, all function calls within a kernel are inlined and the compiler can then, in most cases, use flow analysis to see if something is shared or global. If you're compiling for a device of compute capability <2.0, you may have encountered the warning warning : Cannot tell what pointer points to, assuming global memory space. This is what you get when the compiler can't follow your pointers around correctly.