Why local memory doesn't work with dynamic parallelism? [duplicate] - cuda

I am wondering why they have the same memory address, when If I remember correctly, each thread has a own copy of created variable in this way:
__global__ void
Matrix_Multiplication_Shared(
const int* const Matrix_A,
const int* const Matrix_B,
int* const Matrix_C)
{
const int sum_value = threadIdx.x;
printf("%p \n", &sum_value);
}
Output:
I am considering the case of one thread's block, for example with 2 and more threads.

NVIDIA GPUs have multiple address spaces.
The primary virtual address spaced used by pointers is called the generic address space. Inside the generic address space are windows for local memory and shared memory. The rest of the generic address space is the global address space. PTX and the GPU instruction set support additional instructions for 0 based access to the local and shared memory address space.
Some automatic variables and stack memory is in the local memory address space. The primary difference between global memory and the local memory is that local memory is organized such that consecutive 32-bit words are accessed by consecutive thread IDs. If each thread reads or writes from the same local memory offset then the memory access is fully coalesced.
In PTX local memory is accessed via ld.local and st.local.
In GPU SASS the instructions have two forms:
LDL, STL are direct access to local memory given as 0-based offset
LD, ST can be used for local memory access through the generic local memory window.
When you take the address of the variable the generic address space address is returned. Each thread is seeing the same offset from the generic local memory window base pointer. The load store unit will covert the 0-based offset into to a unique per thread global address.
For more information see:
CUDA Programming Guide section on Local Memory
PTX ISA section on Generic Addressing. Details on local memory are scattered throughout the manual.

Related

Is there a guideline about register and local memory in cuda programing? [duplicate]

This question already has answers here:
In a CUDA kernel, how do I store an array in "local thread memory"?
(5 answers)
Closed 3 months ago.
The number of registers is limited in gpu, e.g. A100. Each thread cannot use over 255 registers.
But during my test, even not over 255, the compiler use local memory instead of register. Is there a more detailed guideline about how to keep my data in register, and when it would be in local memory?
I try to define a local array in my kernel. It looks like the array len would affect the action of compilier.
template<int len>
global void test(){
// ...
float arr[len];
// ...
}
Local arrays are placed in local memory if it is not accessed by compile-time constant indices.
This is described in the Programming Guide Section 5.3.2 Paragraph Local Memory. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses
Local memory accesses only occur for some automatic variables as mentioned in Variable Memory Space Specifiers. Automatic variables that the compiler is likely to place in local memory are:
Arrays for which it cannot determine that they are indexed with constant quantities,
Large structures or arrays that would consume too much register space,
Any variable if the kernel uses more registers than available (this is also known as register spilling).

NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).
So, I have two questions:
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
How can I initialize (populate) those arrays from values that are computed at the run time on the host?
I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.
How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)
In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.
How can I initialize (populate) those arrays from values that are computed at the runtime on the host?
As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.
Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.
For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:
auto kernel = R"(
..
__constant__ ##Type## buffer[##SIZE##]={
##elm##
};
..
__global__ void test(int * input)
{ }
)";
then replace ##-pattern words with size and element value information in run-time and compile like this:
__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };
So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.
Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.
Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?
Disclaimer: I cannot help you with CUDA.
For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.
Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
constant float* fixed_parameters,
global const uint* dynamic_input_data,
global uint* restrict output_data)
{
cl_mem fixed_parameter_buffer = clCreateBuffer(
cl_context,
CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
NULL);
clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);
Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.

Why variables in thread's block have the same memory address? Cuda

I am wondering why they have the same memory address, when If I remember correctly, each thread has a own copy of created variable in this way:
__global__ void
Matrix_Multiplication_Shared(
const int* const Matrix_A,
const int* const Matrix_B,
int* const Matrix_C)
{
const int sum_value = threadIdx.x;
printf("%p \n", &sum_value);
}
Output:
I am considering the case of one thread's block, for example with 2 and more threads.
NVIDIA GPUs have multiple address spaces.
The primary virtual address spaced used by pointers is called the generic address space. Inside the generic address space are windows for local memory and shared memory. The rest of the generic address space is the global address space. PTX and the GPU instruction set support additional instructions for 0 based access to the local and shared memory address space.
Some automatic variables and stack memory is in the local memory address space. The primary difference between global memory and the local memory is that local memory is organized such that consecutive 32-bit words are accessed by consecutive thread IDs. If each thread reads or writes from the same local memory offset then the memory access is fully coalesced.
In PTX local memory is accessed via ld.local and st.local.
In GPU SASS the instructions have two forms:
LDL, STL are direct access to local memory given as 0-based offset
LD, ST can be used for local memory access through the generic local memory window.
When you take the address of the variable the generic address space address is returned. Each thread is seeing the same offset from the generic local memory window base pointer. The load store unit will covert the 0-based offset into to a unique per thread global address.
For more information see:
CUDA Programming Guide section on Local Memory
PTX ISA section on Generic Addressing. Details on local memory are scattered throughout the manual.

Persistent buffers in CUDA

I have an application where I need to allocate and maintain a persistent buffer which can be used by successive launches of multiple kernels in CUDA. I will eventually need to copy the contents of this buffer back to the host.
I had the idea to declare a global scope device symbol which could be directly used in different kernels without being passed as an explicit kernel argument, something like
__device__ char* buffer;
but then I am uncertain how I should allocate memory and assign the address to this device pointer so that the memory has the persistent scope I require. So my question is really in two parts:
What is the lifetime of the various methods of allocating global memory?
How should I allocate memory and assign a value to the global scope pointer? Is it necessary to use device code malloc and run a setup kernel to do this, or can I use some combination of host side APIs to achieve this?
[Postscript: this question has been posted as a Q&A in response to this earlier SO question on a similar topic]
What is the lifetime of the various methods of allocating global memory?
All global memory allocations have a lifetime of the context in which they are allocated. This means that any global memory your applications allocates is "persistent" by your definition, irrespective of whether you use host side APIs or device side allocation on the GPU runtime heap.
How should I allocate memory and assign a value to the global scope
pointer? Is it necessary to use device code malloc and run a setup
kernel to do this, or can I use some combination of host side APIs to
achieve this?
Either method will work as you require, although host APIs are much simpler to use. There are also some important differences between the two approaches.
Memory allocations using malloc or new in device code are allocated on a device runtime heap. This heap must be sized appropriately using the cudaDeviceSetLimit API before running mallocin device code, otherwise the call may fail. And the device heap is not accessible to host side memory management APIs , so you also require a copy kernel to transfer the memory contents to host API accessible memory before you can transfer the contents back to the host.
The host API case, on the other hand, is extremely straightforward and has none of the limitations of device side malloc. A simple example would look something like:
__device__ char* buffer;
int main()
{
char* d_buffer;
const size_t buffer_sz = 800 * 600 * sizeof(char);
// Allocate memory
cudaMalloc(&d_buffer, buffer_sz);
// Zero memory and assign to global device symbol
cudaMemset(d_buffer, 0, buffer_sz);
cudaMemcpyToSymbol(buffer, &d_buffer, sizeof(char*));
// Kernels go here using buffer
// copy to host
std::vector<char> results(800*600);
cudaMemcpy(&results[0], d_buffer, buffer_sz, cudaMemcpyDeviceToHost);
// buffer has lifespan until free'd here
cudaFree(d_buffer);
return 0;
};
[Standard disclaimer: code written in browser, not compiled or tested, use at own risk]
So basically you can achieve what you want with standard host side APIs: cudaMalloc, cudaMemcpyToSymbol, and cudaMemcpy. Nothing else is required.

Is local memory access coalesced?

Suppose, I declare a local variable in a CUDA kernel function for each thread:
float f = ...; // some calculations here
Suppose also, that the declared variable was placed by a compiler to a local memory (which is the same as global one except it is visible for one thread only as far as I know). My question is will the access to f be coalesced when reading it?
I don't believe there is official documentation of how local memory (or stack on Fermi) is laid out in memory, but I am pretty certain that mulitprocessor allocations are accessed in a "striped" fashion so that non-diverging threads in the same warp will get coalesced access to local memory. On Fermi, local memory is also cached using the same L1/L2 access mechanism as global memory.
CUDA cards don't have memory allocated for local variables. All local variables are stored in registers. Complex kernels with lots of variables reduce the number of threads that can run concurrently, a condition known as low occupancy.