How to use shared memory between kernel call of CUDA? - cuda

I want to use shared memory between kernel call of one kernel.
Can I use shared memory between kernel call?

No, you can't. Shared memory has thread block life-cycle. A variable stored in it can be accessible by all the threads belonging to one group during one __global__ function invocation.

Take a try of page-locked memory, but the speed should be much slower than graphic memory.
cudaHostAlloc (void **ptr, size_t size, cudaHostAllocMapped);
then send the ptr to the kernel code.

Previously you could do it in a non-standard way where you would have a unique id for each shared memory block and the next kernel would check the id and therefore carry out required processing on this shared memory block. This was hard to implement as you needed to ensure full occupancy for each kernel and deal with various corner cases. In addition, without formal support you coulf not rely on compatibility across compute device and cuda versions.

Related

dynamic stack allocation on the device with cuda

Title says it all.
Is there anything similar to alloca() functionality in cuda but for the device side? I need to allocate small-sized arrays(nxn and nx1, with n<=10), n is a dynamic variable.
thanks!
You will have to use malloc or new inside the kernel. Be careful though, as this will allocate memory in the global memory space.

What's the better practice for temporary device pointers in cuda, reuse a fixed device pointer or create and free device pointers?

In cuda kernel functions there's no automatic garbage collection. What's the better practice for temporary device pointers in Cuda? Reuse a fixed device pointer, or create and free device pointers?
For example, to write a Cuda kernel function for sum of squared errors between two vectors, it's more convenient to have a temporary device pointer for storing the difference of the two vectors and then sum the squares of elements of this temporary device pointer. An option is to allocate a temporary device pointer and then free it for every function call, and an another option is to have a constantly reused temporary device pointer.
What's the better practice between these two options?
If you can use cudaMalloc and cudaFree and avoid multiple allocations, you should avoid using dynamic memory allocation within the kernel, as it has an additional cost on performance and is limited in size depending on the launch configuration :
The following API functions get and set the heap size:
cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The heap size granted will be at least size bytes. cuCtxGetLimit()and
cudaDeviceGetLimit() return the currently requested heap size.
The actual memory allocation for the heap occurs when a module is
loaded into the context, either explicitly via the CUDA driver API
(see Module), or implicitly via the CUDA runtime API (see CUDA C
Runtime).
See Dynamic global memory allocation in CUDA Documentation.

cudaMemcpy & blocking

I'm confused by some comments I've seen about blocking and cudaMemcpy. It is my understanding that the Fermi HW can simultaneously execute kernels and do a cudaMemcpy.
I read that Lib func cudaMemcpy() is a blocking function. Does this mean the func will block further execution until the copy has has fully completed? OR Does this mean the copy won't start until the previous kernels have finished?
e.g. Does this code provide the same blocking operation?
SomeCudaCall<<<25,34>>>(someData);
cudaThreadSynchronize();
vs
SomeCudaCall<<<25,34>>>(someParam);
cudaMemcpy(toHere, fromHere, sizeof(int), cudaMemcpyHostToDevice);
Your examples are equivalent. If you want asynchronous execution you can use streams or contexts and cudaMemcpyAsync, so that you can overlap execution with copy.
According to the NVIDIA Programming guide:
In order to facilitate concurrent execution between host and device, some function calls are asynchronous: Control is returned to the host thread before the device has completed the requested task. These are:
Kernel launches;
Memory copies between two addresses to the same device memory;
Memory copies from host to device of a memory block of 64 KB or less;
Memory copies performed by functions that are suffixed with Async;
Memory set function calls.
So as long as your transfer size is larger than 64KB your examples are equivalent.

Dynamic memory allocation in __global__ functions

I have a CC 1.1 card and my program entails me to dynamically allocate arrays in global or device functions.
These arrays will be created for every thread for execution.
malloc throws up an error and surfing the web tells me that using malloc is illegal for CC less than 2.0.
I wanna ask is there any workaround to it?
Thanks
I would suggest you to use fixed size memory:
__global__ my_kernel(...) {
__shared__ float memory[BLOCK_SIZE];
};
dynamic allocation on the GPU is rarely need and can introduce most likely some performance bottleneck. And specially with a compute capability 1.1 you will need to tweak the alignments
of the shared memory to have the best performances and avoid intra-Warp memory contention.
For CC1.1 devices the only workaround is to allocate enough global memory from host with cudaMalloc and then divide it between threads.
In most cases just pre-allocating memory from host works pretty well and I've never encountered a task where one had to use kernel malloc (though sometimes such idea seemed good, it quickly turned out that it was not that good to break compatibility with older devices. Also I have suspicions regarding its performance, but I've never run any benchmarks).

CUDA shared memory

I need to know something about CUDA shared memory. Let's say I assign 50 blocks with 10 threads per block in a G80 card. Each SM processor of a G80 can handle 8 blocks simultaneously. Assume that, after doing some calculations, the shared memory is fully occupied.
What will be the values in shared memory when the next 8 new blocks arrive? Will the previous values reside there? Or will the previous values be copied to global memory and the shared memory refreshed for next 8 blocks?
It states about the type qualifiers:
Variables in registers for a thread, only stays in kernel
Variables in global memory for a thread, only stays in kernel
__device__ __shared__ type variable in shared memory for a block, only stays in kernel
__device__ type variable in global memory for a grid, stays until the application exits
__device__ __constant__ type variable for a grid, stays until the application exits
thus from this reference, the answer to your question is the memory should be refreshed for the next 8 blocks if they reside in shared memory of your device.
For kernel blocks, the execution order and SMs are randomly assigned. In that sense, even if the old value or address preserves, it is hard to keep things in track. I doubt there is even a way to do that. Communication between blocks are done via off chip memory. The latency associated with off chip memory is the performance killer, which makes gpu programming tricky. In Fermi cards, blocks share some L2 cache, but one can't alter the behavior of these caches.