Free shared memory in CUDA - cuda

Is there any application level API available to free shared memory allocated by CTA in CUDA? I want to reuse my CTA for another task and before starting that task I should clean memory used by previous task.

Shared memory is statically allocated at kernel launch time. You can optionally specify an unsized shared allocation in the kernel:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
}
The third kernel launch parameter then specifies how much shared memory corresponds to that unsized allocation.
MyKernel<<<blocks, threads, numInts*sizeof(int)>>>( ... );
The total amount of shared memory allocated for the kernel launch is the sum of the amount declared in the kernel, plus the shared memory kernel parameter, plus alignment overhead. You cannot "free" it - it stays allocated for the duration of the kernel launch.
For kernels that go through multiple phases of execution and need to use the shared memory for different purposes, what you can do is reuse the memory with shared memory pointers - use pointer arithmetic on the unsized declaration.
Something like:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
__syncthreads();
char *nowINeedChars = (char *) extraShared;
...
}
I don't know of any SDK samples that use this idiom, though the threadFenceReduction sample declares a __shared__ bool and also uses shared memory to hold the partial sums of the reduction.

Related

CUDA shared memory under the hood questions

I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.

CUDA static shared memory deallocation

Is there any way to deallocate shared memory previosuly allocated inside the same CUDA kernel?
For example, inside the kernel at one point I have defined
__shared__ unsigned char flag;
__shared__ unsigned int values [ BLOCK_DIM ];
Later on inside the code, I need to define an array that with considering previously defined shared memory exceeds the shared memory limit set for a block. How can I do that without dirty works of re-using previously defined shared memory? Or NVCC is smart enough to recognize dependencies along the kernel trace and deallocates it whenever done using shared variables?
My device is GeForce GTX 780 (CC=3.5).
In C/C++, it is not possible to deallocate statically defined arrays.
You may wish to dynamically allocate the amount of shared memory needed for the worst case as follows. Add
extern __shared__ float foo[];
within the kernel function and launch your kernel function as
myKernel<<<numBlocks, numThreads, sh_mem_size>>> (...);
Remember that you can manage multiple arrays by playing with pointers. Have a look at the CUDA C Programming Guide for further details. For example, quoting the Guide
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
By the same concept, you can dynamically change the size of the arrays you are dealing with.

CUDA context lifetime

In my application I have some part of the code that works as follows
main.cpp
int main()
{
//First dimension usually small (1-10)
//Second dimension (100 - 1500)
//Third dimension (10000 - 1000000)
vector<vector<vector<double>>> someInfo;
Object someObject(...); //Host class
for (int i = 0; i < N; i++)
someObject.functionA(&(someInfo[i]));
}
Object.cpp
void SomeObject::functionB(vector<vector<double>> *someInfo)
{
#define GPU 1
#if GPU == 1
//GPU COMPUTING
computeOnGPU(someInfo, aConstValue, aSecondConstValue);
#else
//CPU COMPUTING
#endif
}
Object.cu
extern "C" void computeOnGPU(vector<vector<double>> *someInfo, int aConstValue, int aSecondConstValue)
{
//Copy values to constant memory
//Allocate memory on GPU
//Copy data to GPU global memory
//Launch Kernel
//Copy data back to CPU
//Free memory
}
So as (I hope) you can see in the code, the function that prepares the GPU is called many times depending on the value of the first dimension.
All the values that I send to constant memory always remain the same and the sizes of the pointers allocated in global memory are always the same (the data is the only one changing).
This is the actual workflow in my code but I'm not getting any speedup when using GPU, I mean the kernel does execute faster but the memory transfers became my problem (as reported by nvprof).
So I was wondering where in my app the CUDA context starts and finishes to see if there is a way to do only once the copies to constant memory and memory allocations.
Normally, the cuda context begins with the first CUDA call in your application, and ends when the application terminates.
You should be able to do what you have in mind, which is to do the allocations only once (at the beginning of your app) and the corresponding free operations only once (at the end of your app) and populate __constant__ memory only once, before it is used the first time.
It's not necessary to allocate and free the data structures in GPU memory repetetively, if they are not changing in size.

Is it worthwhile to pass kernel parameters via shared memory?

Suppose that we have an array int * data, each thread will access one element of this array. Since this array will be shared among all threads it will be saved inside the global memory.
Let's create a test kernel:
__global__ void test(int *data, int a, int b, int c){ ... }
I know for sure that the data array will be in global memory because I allocated memory for this array using cudaMalloc. Now as for the other variables, I've seen some examples that pass an integer without allocating memory, immediately to the kernel function. In my case such variables are a b and c.
If I'm not mistaken, even though we do not call directly cudaMalloc to allocate 4 bytes for each three integers, CUDA will automatically do it for us, so in the end the variables a b and c will be allocated in the global memory.
Now these variables, are only auxiliary, the threads only read them and nothing else.
My question is, wouldn't it be better to transfer these variables to the shared memory?
I imagine that if we had for example 10 blocks with 1024 threads, we would need 10*3 = 30 reads of 4 bytes in order to store the numbers in the shared memory of each block.
Without shared memory and if each thread has to read all these three variables once, the total amount of global memory reads will be 1024*10*3 = 30720 which is very inefficient.
Now here is the problem, I'm somewhat new to CUDA and I'm not sure if it's possible to transfer the memory for variables a b and c to the shared memory of each block without having each thread reading these variables from the global memory and loading them to the shared memory, so in the end the total amount of global memory reads would be 1024*10*3 = 30720 and not 10*3 = 30.
On the following website there is this example:
__global__ void staticReverse(int *d, int n)
{
__shared__ int s[64];
int t = threadIdx.x;
int tr = n-t-1;
s[t] = d[t];
__syncthreads();
d[t] = s[tr];
}
Here each thread loads different data inside the shared variable s. So each thread, according to its index, loads the specified data inside the shared memory.
In my case, I want to load only variables a b and c to the shared memory. These variables are always the same, they don't change, so they don't have anything to do with the threads themselves, they are auxiliary and are being used by each thread to run some algorithm.
How should I approach this problem? Is it possible to achieve this by only doing total_amount_of_blocks*3 global memory reads?
The GPU runtime already does this optimally without you needing to do anything (and your assumption about how argument passing works in CUDA is incorrect). This is presently what happens:
In compute capability 1.0/1.1/1.2/1.3 devices, kernel arguments are passed by the runtime in shared memory.
In compute capability 2.x/3.x/4.x/5.x/6.x devices, kernel arguments are passed by the runtime in a reserved constant memory bank (which has a dedicated cache with broadcast).
So in your hypothetical kernel
__global__ void test(int *data, int a, int b, int c){ ... }
data, a, b, and c are all passed by value to each block in either shared memory or constant memory (depending on GPU architecture) automatically. There is no advantage in doing what you propose.

Allocate shared variables in CUDA

How to allocate shared variables in CUDA? I have a kernel where data needs to be shared across threads belonging to a particular block. I need two shared variables named sid and eid. I use it like this:
extern __shared__ int sid, eid
but it is giving me an error that __shared__ variables cannot have external linkage.
There are two ways to allocate shared memory : static and dynamic
1、static
__shared__ int Var1[10]
2、dynamic : should add "extern" keyword
extern __shared__ int Var1[]
If you use dynamic way to allocate shared memory, you should set the shared memory size when you call the function. for example:
testKernel <<< grid, threads, size>>>(...)
the third parameter is the size of shared memory. In this way, all the shared memories start from the same address. If you want to define several shared memory variables, you should write code like following.
__global__ void func(...)
{
extern __shared__ char array[];
short * array0 = (short*)array;
float * array1 = (float*)(&array0[128]);
}