Is there any way to deallocate shared memory previosuly allocated inside the same CUDA kernel?
For example, inside the kernel at one point I have defined
__shared__ unsigned char flag;
__shared__ unsigned int values [ BLOCK_DIM ];
Later on inside the code, I need to define an array that with considering previously defined shared memory exceeds the shared memory limit set for a block. How can I do that without dirty works of re-using previously defined shared memory? Or NVCC is smart enough to recognize dependencies along the kernel trace and deallocates it whenever done using shared variables?
My device is GeForce GTX 780 (CC=3.5).
In C/C++, it is not possible to deallocate statically defined arrays.
You may wish to dynamically allocate the amount of shared memory needed for the worst case as follows. Add
extern __shared__ float foo[];
within the kernel function and launch your kernel function as
myKernel<<<numBlocks, numThreads, sh_mem_size>>> (...);
Remember that you can manage multiple arrays by playing with pointers. Have a look at the CUDA C Programming Guide for further details. For example, quoting the Guide
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
By the same concept, you can dynamically change the size of the arrays you are dealing with.
Related
I am trying to get a better grasp of memory management in cuda. There is Something that is just now occurring to me as a major lack of understanding. How do kernels access values that, as I understand it, should be in host memory.
When vectorAdd() is called, it runs the function on the device. But only the elements are stored on the device memory. the length of the vectors are stored on the host. How is it that the kernel does not exit with an error from trying to access foo.length, something that should be on the host.
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>
#include <stdlib.h>
typedef struct{
float *elements;
int length;
}vector;
__global__ void vectorAdd(vector foo, vector bar){
int idx = threadIdx.x + blockDim.x * blockId.x.x;
if(idx < foo.length){ //this is the part that I do not understand
foo.elements[idx] += bar.elements[idx];
}
}
int main(void){
vector foo, bar;
foo.length = bar.length = 50;
cudaMalloc(&(foo.elements), sizeof(float)*50);
cudaMalloc(&(bar.elements), sizeof(float)*50);
//these vectors are empty, so adding is just a 0.0 += 0.0
int blocks_per_grid = 10;
int threads_per_block = 5;
vectorAdd<<<blocks_per_grid, threads_per_block>>>(foo, bar);
return 0;
}
In C and C++, a typical mechanism for making arguments available to the body of a function call is pass-by-value. The basic idea is that a separate copy of the arguments are made, for use by the function.
CUDA claims compliance to C++ (subject to various limitations), and it therefore provides a mechanism for pass-by-value. On a kernel call, the CUDA compiler and runtime will make copies of the arguments, for use by the function (kernel). In the case of a kernel call, these copies are placed in a particular area of __constant__ memory which is in the GPU and within GPU memory space, and therefore "accessible" to device code.
So, in your example, the entire structures passed as the arguments for the parameters vector foo, vector bar are copied to GPU device memory (specifically, constant memory) by the CUDA runtime. The CUDA device code is structured in such a way by the compiler to access these arguments as needed directly from constant memory.
Since those structures contain both the elements pointer and the scalar quantity length, both items are accessible in CUDA device code, and the compiler will structure references to them (e.g. foo.length) so as to retrieve the needed quantities from constant memory.
So the kernels are not accessing host memory in your example. The pass-by-value mechanism makes the quantities available to device code, in GPU constant memory.
I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.
In kernel function, I want two vectors of shared memory, both with size length (actually sizeof(float)*size).
Since it is not possible to allocate memory directly in the kernel function if a variable is needed, I had to allocate it dynamically, like:
myKernel<<<numBlocks, numThreads, 2*sizeof(float)*size>>> (...);
and, inside the kernel:
extern __shared__ float row[];
extern __shared__ float results[];
But, this doesn't work.
Instead of this, I made only one vector extern __shared__ float rowresults[] containing all the data, using the 2*size memory allocated. So row calls are still the same, and results calls are like rowresults[size+previousIndex]. And this does work.
It is not a big problem because I get my expected results anyway, but is there any way to split my dynamically allocated shared memory into two (or more) different variables? Just for beauty.
The C Programming guide section on __shared__ includes examples where you allocate multiple arrays from dynamically allocated shared memory:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
Since you're just getting a pointer to an element and making that a new array, I believe you could adapt that to use dynamic offsets instead of the static offsets they have in the example. They also note that the alignment has to be the same, which shouldn't be an issue in your case.
Is there any application level API available to free shared memory allocated by CTA in CUDA? I want to reuse my CTA for another task and before starting that task I should clean memory used by previous task.
Shared memory is statically allocated at kernel launch time. You can optionally specify an unsized shared allocation in the kernel:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
}
The third kernel launch parameter then specifies how much shared memory corresponds to that unsized allocation.
MyKernel<<<blocks, threads, numInts*sizeof(int)>>>( ... );
The total amount of shared memory allocated for the kernel launch is the sum of the amount declared in the kernel, plus the shared memory kernel parameter, plus alignment overhead. You cannot "free" it - it stays allocated for the duration of the kernel launch.
For kernels that go through multiple phases of execution and need to use the shared memory for different purposes, what you can do is reuse the memory with shared memory pointers - use pointer arithmetic on the unsized declaration.
Something like:
__global__ void MyKernel()
{
__shared__ int fixedShared;
extern __shared__ int extraShared[];
...
__syncthreads();
char *nowINeedChars = (char *) extraShared;
...
}
I don't know of any SDK samples that use this idiom, though the threadFenceReduction sample declares a __shared__ bool and also uses shared memory to hold the partial sums of the reduction.
How to allocate shared variables in CUDA? I have a kernel where data needs to be shared across threads belonging to a particular block. I need two shared variables named sid and eid. I use it like this:
extern __shared__ int sid, eid
but it is giving me an error that __shared__ variables cannot have external linkage.
There are two ways to allocate shared memory : static and dynamic
1、static
__shared__ int Var1[10]
2、dynamic : should add "extern" keyword
extern __shared__ int Var1[]
If you use dynamic way to allocate shared memory, you should set the shared memory size when you call the function. for example:
testKernel <<< grid, threads, size>>>(...)
the third parameter is the size of shared memory. In this way, all the shared memories start from the same address. If you want to define several shared memory variables, you should write code like following.
__global__ void func(...)
{
extern __shared__ char array[];
short * array0 = (short*)array;
float * array1 = (float*)(&array0[128]);
}