dynamic stack allocation on the device with cuda - cuda

Title says it all.
Is there anything similar to alloca() functionality in cuda but for the device side? I need to allocate small-sized arrays(nxn and nx1, with n<=10), n is a dynamic variable.
thanks!

You will have to use malloc or new inside the kernel. Be careful though, as this will allocate memory in the global memory space.

Related

CUDA equivalent of OpenCL CL_MEM_USE_HOST_PTR

I'd like to know if there is something similar to CL_MEM_USE_HOST_PTR but for CUDA. Reading the CUDA docs it seems the only "zero-copy" functionality is implemented through the API function cudaHostAlloc. The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area. A thing that is normal with OpenCL using the specificied flag for clCreateBuffer.
Maybe I am wrong, but it looks like CUDA doesn't implement such a thing at all.
The problem is that CUDA allocates the memory and there is no way for me to divert it to some preallocated CPU memory area.
The API call that does that in CUDA is cudaHostRegister(), see here.
It takes a pointer returned by an ordinary host allocator such as malloc() or new, and converts the memory region into pinned memory. (Which would be suitable for "zero-copy" usage, among other things.)

What's the better practice for temporary device pointers in cuda, reuse a fixed device pointer or create and free device pointers?

In cuda kernel functions there's no automatic garbage collection. What's the better practice for temporary device pointers in Cuda? Reuse a fixed device pointer, or create and free device pointers?
For example, to write a Cuda kernel function for sum of squared errors between two vectors, it's more convenient to have a temporary device pointer for storing the difference of the two vectors and then sum the squares of elements of this temporary device pointer. An option is to allocate a temporary device pointer and then free it for every function call, and an another option is to have a constantly reused temporary device pointer.
What's the better practice between these two options?
If you can use cudaMalloc and cudaFree and avoid multiple allocations, you should avoid using dynamic memory allocation within the kernel, as it has an additional cost on performance and is limited in size depending on the launch configuration :
The following API functions get and set the heap size:
cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The heap size granted will be at least size bytes. cuCtxGetLimit()and
cudaDeviceGetLimit() return the currently requested heap size.
The actual memory allocation for the heap occurs when a module is
loaded into the context, either explicitly via the CUDA driver API
(see Module), or implicitly via the CUDA runtime API (see CUDA C
Runtime).
See Dynamic global memory allocation in CUDA Documentation.

Variable in the shared and managed memory in cuda

With the unified memory feature now in CUDA, variables can be in the managed memory and this makes the code a little simpler. Whereas the shared memory is shared between threads in a thread block. See section 3.2 in CUDA Fortran.
My question is can a variable be in both the shared and managed memory? They would be managed in the host but on the device, they would be sharedWhat type of behaviour maybe expected of this type of variable ?
I am using CUDA Fortran. I ask this question because, declaring the variable as managed makes it easier for me to code whereas making it shared in the device makes it faster than the global device memory.
I could not find anything that gave me a definitive answer in the documentation.
My question is can a variable be in both the shared and managed memory?
No, it's not possible.
Managed memory is created with either a static allocator (__managed__ in CUDA C, or managed attribute in CUDA fortran) or a dynamic allocator (cudaMallocManaged in CUDA C, or managed (allocatable) attribute in CUDA fortran).
Both of these are associated with the logical global memory space in CUDA. The __shared__ (or shared) memory space is a separate logical space, and must be used (allocated, accessed) explicitly, independent of any global space usage.

How to use shared memory between kernel call of CUDA?

I want to use shared memory between kernel call of one kernel.
Can I use shared memory between kernel call?
No, you can't. Shared memory has thread block life-cycle. A variable stored in it can be accessible by all the threads belonging to one group during one __global__ function invocation.
Take a try of page-locked memory, but the speed should be much slower than graphic memory.
cudaHostAlloc (void **ptr, size_t size, cudaHostAllocMapped);
then send the ptr to the kernel code.
Previously you could do it in a non-standard way where you would have a unique id for each shared memory block and the next kernel would check the id and therefore carry out required processing on this shared memory block. This was hard to implement as you needed to ensure full occupancy for each kernel and deal with various corner cases. In addition, without formal support you coulf not rely on compatibility across compute device and cuda versions.

Dynamic memory allocation in __global__ functions

I have a CC 1.1 card and my program entails me to dynamically allocate arrays in global or device functions.
These arrays will be created for every thread for execution.
malloc throws up an error and surfing the web tells me that using malloc is illegal for CC less than 2.0.
I wanna ask is there any workaround to it?
Thanks
I would suggest you to use fixed size memory:
__global__ my_kernel(...) {
__shared__ float memory[BLOCK_SIZE];
};
dynamic allocation on the GPU is rarely need and can introduce most likely some performance bottleneck. And specially with a compute capability 1.1 you will need to tweak the alignments
of the shared memory to have the best performances and avoid intra-Warp memory contention.
For CC1.1 devices the only workaround is to allocate enough global memory from host with cudaMalloc and then divide it between threads.
In most cases just pre-allocating memory from host works pretty well and I've never encountered a task where one had to use kernel malloc (though sometimes such idea seemed good, it quickly turned out that it was not that good to break compatibility with older devices. Also I have suspicions regarding its performance, but I've never run any benchmarks).