__device__ __constant__ const - cuda

Is there any difference and what is the best way to define device constants in a CUDA program? In the C++, host/device program if I want to define constants to be in device constant memory I can do either
__device__ __constant__ float a = 5;
__constant__ float a = 5;
Question 1. On devices 2.x and CUDA 4, is it the same as,
__device__ const float a = 5;
Question 2. Why is it that in PyCUDA SourceModule("""..."""), which compiles only do device code, even the following works?
const float a = 5;

In CUDA __constant__is a variable type qualifier that indicates the variable being declared is to be stored in device constant memory. Quoting section B 2.2 of the CUDA programming guide
The __constant__ qualifier, optionally used together with __device__,
declares a variable that:
Resides in constant memory space,
Has the lifetime of an application,
Is accessible from all the threads
within the grid and from the host through the runtime library
(cudaGetSymbolAddress() / cudaGetSymbolSize() / cudaMemcpyToSymbol() /
cudaMemcpyFromSymbol() for the runtime API and cuModuleGetGlobal() for
the driver API).
In CUDA, constant memory is a dedicated, static, global memory area accessed via a cache (there are a dedicated set of PTX load instructions for its purpose) which are uniform and read-only for all threads in a running kernel. But the contents of constant memory can be modified at runtime through the use of the host side APIs quoted above. This is different from declaring a variable to the compiler using the const declaration, which is adding a read-only characteristic to a variable at the scope of the declaration. The two are not at all the same thing.

Related

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).
So, I have two questions:
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?
I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.
How can I initialize (populate) those arrays from values that are computed at the runtime on the host?
As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.
Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.
For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:
auto kernel = R"(
..
__constant__ ##Type## buffer[##SIZE##]={
##elm##
};
..
__global__ void test(int * input)
{ }
)";
then replace ##-pattern words with size and element value information in run-time and compile like this:
__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.
Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.
Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?
Disclaimer: I cannot help you with CUDA.
For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.
Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
constant float* fixed_parameters,
global const uint* dynamic_input_data,
global uint* restrict output_data)
{
cl_mem fixed_parameter_buffer = clCreateBuffer(
cl_context,
CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
NULL);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);
Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.

Persistent buffers in CUDA

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

When passing parameter by value to kernel function, where are parameters copied?

I'm beginner at CUDA programming and have a question.
When I pass parameters by value, like this:
__global__ void add(int a, int b, int *c) {
// some operations
}
Since variable a and b are passed to kernel function add as copied value in function call stack, I guessed some memory space would be needed to copy in.
If I'm right, is that additional memory space where those parameters are copied
in GPU or in Host's main memory?
The reason why I wonder this problem is that I should pass a big struct to kernel function.
I also thought pass a pointer of the struct, but these way seems to be required to call cudamalloc for the struct and each member variables.
The very short answer is that all arguments to CUDA kernels are passed by value, and those arguments are copied by the host via an API into a dedicated memory argument buffer on the GPU. At present, this buffer is stored in constant memory and there is a limit of 4kb of arguments per kernel launch -- see here.
In more details, the PTX standard (technically since compute capability 2.0 hardware and the CUDA ABI appeared) defines a dedicated logical state space call .param which hold kernel and device parameter arguments. See here. Quoting from that documentation:
Each kernel function definition includes an optional list of
parameters. These parameters are addressable, read-only variables
declared in the .param state space. Values passed from the host to the
kernel are accessed through these parameter variables using ld.param
instructions. The kernel parameter variables are shared across all
CTAs within a grid.
It further notes that:
Note: The location of parameter space is implementation specific. For example, in some implementations kernel parameters reside in
global memory. No access protection is provided between parameter and
global space in this case. Similarly, function parameters are mapped
to parameter passing registers and/or stack locations based on the
function calling conventions of the Application Binary Interface
(ABI).
So the precise location of the parameter state space is implementation specific. In the first iteration of CUDA hardware, it actually mapped to shared memory for kernel arguments and registers for device function arguments. However, since compute 2.0 hardware and the PTX 2.2 standard, it maps to constant memory for kernels under most circumstances. The documentation says the following on the matter:
The constant (.const) state space is a read-only memory initialized
by the host. Constant memory is accessed with a ld.const
instruction. Constant memory is restricted in size, currently limited
to 64 KB which can be used to hold statically-sized constant
variables. There is an additional 640 KB of constant memory,
organized as ten independent 64 KB regions. The driver may allocate
and initialize constant buffers in these regions and pass pointers to
the buffers as kernel function parameters. Since the ten regions are
not contiguous, the driver must ensure that constant buffers are
allocated so that each buffer fits entirely within a 64 KB region and
does not span a region boundary.
Statically-sized constant variables have an optional variable initializer; constant variables with no explicit initializer are
initialized to zero by default. Constant buffers allocated by the
driver are initialized by the host, and pointers to such buffers are
passed to the kernel as parameters.
[Emphasis mine]
So while kernel arguments are stored in constant memory, this is not the same constant memory which maps to the .const state space accessible by defining a variable as __constant__ in CUDA C or the equivalent in Fortran or Python. Rather, it is an internal pool of device memory managed by the driver and not directly accessible to the programmer.

Can my kernel code tell how much shared memory it has available?

Is it possible for running device-side CUDA code to know how much (static and/or dynamic) shared memory is allocated to each block of the running kernel's grid?
On the host side, you know how much shared memory a launched kernel had (or will have), since you set that value yourself; but what about the device side? It's easy to compile in the upper limit to that size, but that information is not available (unless passed explicitly) to the device. Is there an on-GPU mechanism for obtaining it? The CUDA C Programming Guide doesn't seem to discuss this issue (in or outside of the section on shared memory).
TL;DR: Yes. Use the function below.
It is possible: That information is available to the kernel code in special registers: %dynamic_smem_size and %total_smem_size.
Typically, when we write kernel code, we don't need to be aware of specific registers (special or otherwise) - we write C/C++ code. Even when we do use these registers, the CUDA compiler hides this from us through functions or structures which hold their values. For example, when we use the value threadIdx.x, we are actually accessing the special register %tid.x, which is set differently for every thread in the block. You can see these registers "in action" when you look at compiled PTX code. ArrayFire have written a nice blog post with some worked examples: Demystifying PTX code.
But if the CUDA compiler "hides" register use from us, how can we go behind that curtain and actually insist on using them, accessing them with those %-prefixed names? Well, here's how:
__forceinline__ __device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
and a similar function for %total_smem_size. This function makes the compiler add an explicit PTX instruction, just like asm can be used for host code to emit CPU assembly instructions directly. This function should always be inlined, so when you assign
x = dynamic_smem_size();
you actually just assign the value of the special register to x.

Allocating global variables on multiple GPUs

I have a code working on a single GPU. In that code, I used
__device__ uint32_t aaa;
This line at the begining of code declared a global variable on the only involved device.
Now I want to use multiple devices (two or more), but I don't know how to allocate global variables in this case.
I think I should use cudaSetDevice() but I wonder where I should call this function.
When you create a variable like this:
__device__ int myval;
It is created at global scope. An allocation for it is made in the GPU memory of each device that is present when your application is launched.
In host code (when using such functions as cudaMemcpyFromSymbol()), you will be accessing whichever one corresponds to your most recent cudaSetDevice() call. In device code, you will be accessing whichever one corresponds to the device that your device code is executing on
The __device__ declaration is at global scope (and statically allocated) in your program. Variables at global scope are set up without the help of any runtime activity. Therefore there is no opportunity to specify which devices the variable should be instantiated on, so CUDA instantiates those variables on all devices present. Dynamically allocated device variables however are allocated using the runtime calls cudaMalloc and cudaMemcpy and so we can precede these calls with a cudaSetDevice call in a multi-GPU system, and so the CUDA runtime manages these variables on a per-device basis, which is consistent with the behavior of most CUDA runtime API calls, which operate on the most recently selected device via cudaSetDevice.