In kernel function, I want two vectors of shared memory, both with size length (actually sizeof(float)*size).
Since it is not possible to allocate memory directly in the kernel function if a variable is needed, I had to allocate it dynamically, like:
myKernel<<<numBlocks, numThreads, 2*sizeof(float)*size>>> (...);
and, inside the kernel:
extern __shared__ float row[];
extern __shared__ float results[];
But, this doesn't work.
Instead of this, I made only one vector extern __shared__ float rowresults[] containing all the data, using the 2*size memory allocated. So row calls are still the same, and results calls are like rowresults[size+previousIndex]. And this does work.
It is not a big problem because I get my expected results anyway, but is there any way to split my dynamically allocated shared memory into two (or more) different variables? Just for beauty.
The C Programming guide section on __shared__ includes examples where you allocate multiple arrays from dynamically allocated shared memory:
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
Since you're just getting a pointer to an element and making that a new array, I believe you could adapt that to use dynamic offsets instead of the static offsets they have in the example. They also note that the alignment has to be the same, which shouldn't be an issue in your case.
Related
Out of the following two ways of allocating shared memory statically, which method is correct and why? I get same results for both but I am trying to understand the behavior in a little more detail.
Kernel 1:
__shared__ int as[3][3],bs[3][3];
__global__ void Sharesum(int* a,int* b,int* c,int n)
{
int s,k,i,sum=0;
int tx,ty,bx,by;
tx=threadIdx.x;
ty=threadIdx.y;
as[ty][tx]=a[tx+n*ty];
bs[ty][tx]=b[tx+n*ty];
sum += as[ty][tx]+bs[ty][tx];
c[tx*n+ty]=sum;
}
kernel 2:
__global__ void Sharesum(int* a,int* b,int* c,int n)
{
__shared__ int as[3][3],bs[3][3];
int s,k,i,sum=0;
int tx,ty,bx,by;
tx=threadIdx.x;
ty=threadIdx.y;
as[ty][tx]=a[tx+n*ty];
bs[ty][tx]=b[tx+n*ty];
sum += as[ty][tx]+bs[ty][tx];
c[tx*n+ty]=sum;
}
There shouldn't be any difference between these two methods for what you have shown. I'm not sure there is an answer that suggests that one is "correct" and one isn't.
However, the first one, which we could call a "global scope" declaration, affects all kernels defined in the module. That means all kernels will reserve and have available shared allocations according to the global definition.
The second one only affects the kernel it is scoped to.
Either one or both could be correct, depending on your desired intent.
I am running a fitness function for 1024 matrices, each matrix gets its own block and is the same size. Each block has n*n threads (the dimension of the matrix) and needs to have n*n shared memory so that I can do an easy sum reduction. However, the dimension n for all the matrices is variable before runtime (ie. it can be manually changed, though always a power of 2 so the summation is simple). The problem here is that shared memory must be allocated using a constant, but I also need the value to pass to the kernel from the host. Where do I declare the dimension n so that it is visible to the CPU (for passing to the kernel) and can be used to declare the size of the shared memory (within the kernel)?
My code is structured like this:
from main.cu I call the kernel:
const int num_states = 1024
const int dimension = 4
fitness <<< num_states, dimension * dimension >>> (device_array_of_states, dimension, num_states, device_fitness_return);
and then in kernel.cu I have:
__global__ void fitness(
int *numbers,
int dimension,
int num_states,
int *fitness_return) {
__shared__ int fitness[16]; <<-- needs to be dimension * dimension
//code
}
numbers is an array representing 1024 matrices, dimension is the row and column length, num_states is 1024, fitness_return is an array with length 1024 that holds the fitness value for each matrix. In the kernel, the shared memory is hard coded with the square of dimension (so dimension is 4 in this example).
Where and how can I declare dimension so that it can be used to allocate shared memory as well as call the kernel, this way I only have to update dimension in one place? Thanks for your help.
The amount of allocated shared memory is uniform over all blocks. You might be using a different amount of shared memory in each block, but it's still all available. Also, the amount of shared memory is rather limited regardless, so n*n elements cannot exceed the maximum amount of space (typically 48KiB); for float-type elements (4 bytes each) that would mean n < 340 or so.
Now, there are two ways to allocate shared memory: Static and Dynamic.
Static allocation is what you gave as an example, which would not work:
__shared__ int fitness[16];
in these cases, the size must be known at compile-time (at device-side code compile time) - which is not the case for you.
With Dynamic shared memory allocation, you don't specify the size in the kernel code - you leave it empty and prepend extern:
extern __shared__ int fitness[];
Instead, you specify the amount when launching the kernel, and the threads of the different blocks don't necessarily know what it is.
But in your case, the threads do need to know what n is. Well, just pass it as a kernel argument. So,
__global__ void fitness(
int *numbers,
int dimension,
int num_states,
int *fitness_return,
unsigned short fitness_matrix_order /* that's your n*/)
{
extern __shared__ int fitness[];
/* ... etc ... */
}
nVIDIA's Parallel-for-all blog has a nice post with a more in-depth introduction to using shared memory, which specifically covers static and dynamic shared memory allocation.
I have several questions regarding to CUDA shared memory.
First, as mentioned in this post, shared memory may declare in two different ways:
Either dynamically shared memory allocated, like the following
// Lunch the kernel
dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
This may use inside a kernel as mention:
extern __shared__ int s[];
Or static shared memory, which can use in kernel call like the following:
__shared__ int s[64];
Both are use for different reasons, however which one is better and why ?
Second, I'm running a multi blocks, 256 threads per block kernel. I'm using static shared memory in global and device kernels, both of them uses shared memory. An example is given:
__global__ void startKernel(float* p_d_array)
{
__shared double matA[3*3];
float a1 =0 ;
float a2 = 0;
float a3 = 0;
float b = p_d_array[threadidx.x];
a1 += reduce( b, threadidx.x);
a2 += reduce( b, threadidx.x);
a3 += reduce( b, threadidx.x);
// continue...
}
__device__ reduce ( float data , unsigned int tid)
{
__shared__ float data[256];
// do reduce ...
}
I'd like to know how the shared memory is allocated in such case. I presume each block receive its own shared memory.
What's happening when block # 0 goes into reduce function?
Does the shared memory is allocated in advance to the function call?
I call three different reduce device function, in such case, theoretically in block # 0 , threads # [0,127] may still execute ("delayed due hard word") on the first reduce call, while threads # [128,255] may operate on the second reduce call. In this case, I'd like to know if both reduce function are using the same shared memory?
Even though if they are called from two different function calls ?
On the other hand, Is that possible that a single block may allocated 3*256*sizeof(float) shared memory for both functions calls? That's seems superfluous in CUDA manners, but I still want to know how CUDA operates in such case.
Third, is that possible to gain higher performance in shared memory due to compiler optimization using
const float* p_shared ;
or restrict keyword after the data assignment section?
AFAIR, there is little difference whether you request shared memory "dynamically" or "statically" - in either case it's just a kernel launch parameter be it set by your code or by code generated by the compiler.
Re: 2nd, compiler will sum the shared memory requirement from the kernel function and functions called by kernel.
Is there any way to deallocate shared memory previosuly allocated inside the same CUDA kernel?
For example, inside the kernel at one point I have defined
__shared__ unsigned char flag;
__shared__ unsigned int values [ BLOCK_DIM ];
Later on inside the code, I need to define an array that with considering previously defined shared memory exceeds the shared memory limit set for a block. How can I do that without dirty works of re-using previously defined shared memory? Or NVCC is smart enough to recognize dependencies along the kernel trace and deallocates it whenever done using shared variables?
My device is GeForce GTX 780 (CC=3.5).
In C/C++, it is not possible to deallocate statically defined arrays.
You may wish to dynamically allocate the amount of shared memory needed for the worst case as follows. Add
extern __shared__ float foo[];
within the kernel function and launch your kernel function as
myKernel<<<numBlocks, numThreads, sh_mem_size>>> (...);
Remember that you can manage multiple arrays by playing with pointers. Have a look at the CUDA C Programming Guide for further details. For example, quoting the Guide
extern __shared__ float array[];
__device__ void func() // __device__ or __global__ function
{
short* array0 = (short*)array;
float* array1 = (float*)&array0[128];
int* array2 = (int*)&array1[64];
}
By the same concept, you can dynamically change the size of the arrays you are dealing with.
How to allocate shared variables in CUDA? I have a kernel where data needs to be shared across threads belonging to a particular block. I need two shared variables named sid and eid. I use it like this:
extern __shared__ int sid, eid
but it is giving me an error that __shared__ variables cannot have external linkage.
There are two ways to allocate shared memory : static and dynamic
1、static
__shared__ int Var1[10]
2、dynamic : should add "extern" keyword
extern __shared__ int Var1[]
If you use dynamic way to allocate shared memory, you should set the shared memory size when you call the function. for example:
testKernel <<< grid, threads, size>>>(...)
the third parameter is the size of shared memory. In this way, all the shared memories start from the same address. If you want to define several shared memory variables, you should write code like following.
__global__ void func(...)
{
extern __shared__ char array[];
short * array0 = (short*)array;
float * array1 = (float*)(&array0[128]);
}