This works fine:
a_size=FindSizeAtrunTime();
Kernel<<< gridDim, blockDim, a_size >>>(count)
But this shows error
__global__ void Kernel(int count_a, int count_b)
{
a_size=FindSizeAtrunTime();
__shared__ int a[a_size];
}
error: expression must have a constant value
In both cases size is being determined at runtime. So why the first case is ok, and not the second case?
The second is illegal on two levels.
Firstly, C++98 (which is what CUDA mostly derived from), doesn't allow statically declared, dynamically sized arrays. The language doesn't permit it, and so neither does CUDA.
Secondly, and more importantly, the size of dynamic shared memory allocations must be known before a kernel can be launched. The GPU must know how much shared memory to reserve before a block can be scheduled. That isn't possible in your second example, whereas it is in your first example.
Related
On the host side, I can save the amount of dynamic shared memory I intend to launch a kernel with, and use it. I can even pass that as an argument to the kernel. But - is there a way to get it directly from device code, without help from the host side? That is, have the code for a kernel determine, as it runs, how much dynamic shared memory it has available?
Yes, there's a special register holding that value. named %dynamic_smem_size. You can obtain this register's value in your CUDA C/C++ code by wrapping some inline PTX with a getter function:
__device__ unsigned dynamic_smem_size()
{
unsigned ret;
asm volatile ("mov.u32 %0, %dynamic_smem_size;" : "=r"(ret));
return ret;
}
You can similarly obtain the total size of allocated shared memory (static + dynamic) from the register %total_smem_size.
I'm confused about copying arrays to constant memory.
According to programming guide there's at least one way to allocate constant memory and use it in order to store an array of values. And this is called static memory allocation:
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));
According to programming guide again we can use:
__device__ float* devPointer;
float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));
It looks like dynamic constant memory allocation is used, but I'm not sure about it. And also no qualifier __constant__ is used here.
So here are some questions:
Is this pointer stored in constant memory?
Is assigned (by this pointer) memory stored in constant memory too?
Is this pointer constant? And it's not allowed to change that pointer using device or host function. But is changing values of array prohibited or not? If changing values of array is allowed, then does it mean that constant memory is not used to store this values?
The developer can declare up to 64K of constant memory at file scope. In SM 1.0, the constant memory used by the toolchain (e.g. to hold compile-time constants) was separate and distinct from the constant memory available to developers, and I don't think this has changed since. The driver dynamically manages switching between different views of constant memory as it launches kernels that reside in different compilation units. Although you cannot allocate constant memory dynamically, this pattern suffices because the 64K limit is not system-wide, it only applies to compilation units.
Use the first pattern cited in your question: statically declare the constant data and update it with cudaMemcpyToSymbol before launching kernels that reference it. In the second pattern, only reads of the pointer itself will go through constant memory. Reads using the pointer will be serviced by the normal L1/L2 cache hierarchy.
I am new with CUDA, and I am confuse with the kernel calls.
When you call a Kernel method you specify the number of blocks and the thread per block, like this kernelMethod<<< block, Threads >>>(parameters);"
So why it is possible to use a 3rd parameter?
kernelMethod<<< block, Threads, ???>>>(parameters);
Using cudaDeviceProp you can read the number of thread per block in the variable maxThreadsPerBlock. But how can I know the maximum number of blocks?
Thanks!!
The third parameter specifies the amount of shared memory per block to be dynamically allocated. The programming guide provides additional detail about shared memory, as well as a description and example.
Shared memory can be allocated statically in a kernel:
__shared__ int myints[256];
or dynamically:
extern __shared__ int myints[];
In the latter case, it's necessary to pass as an additional kernel config parameter (the 3rd parameter you mention) the size of the shared memory to be allocated in bytes.
In that event, the pointer myints then points to the beginning of that dynamically allocated region.
The maximum number of blocks is specified per grid dimension (x, y, z) and can also be obtained through the device properties query. It is specified in the maxGridSize parameter. You may want to refer to the deviceQuery sample for a worked example.
Reduction in CUDA has utterly baffled me! First off, both this tutorial by Mark Harris and this one by Mike Giles make use of the declaration extern __shared__ temp[]. The keyword extern is used in C when a declaration is made, but allocation takes place "elsewhre" (e.g. in another C file context in general). What is the relevance of extern here? Why don't we use:
__shared__ float temp[N/2];
for instance? Or why don't we declare temp to be a global variable, e.g.
#define N 1024
__shared__ float temp[N/2];
__global__ void sum(float *sum, float *data){ ... }
int main(){
...
sum<<<M,L>>>(sum, data);
}
I have yet another question? How many blocks and threads per block should one use to invoke the summation kernel? I tried this example (based on this).
Note: You can find information about my devices here.
The answer to the first question is that CUDA supports dynamic shared memory allocation at runtime (see this SO question and the documentation for more details). The declaration of shared memory using extern denotes to the compiler that shared memory size will be determined at kernel launch, passed in bytes as an argument to the <<< >>> syntax (or equivalently via an API function), something like:
sum<<< gridsize, blocksize, sharedmem_size >>>(....);
The second question is normally to launch the number of blocks which will completely fill all the streaming multiprocessors on your GPU. Most sensibly written reduction kernels will accumulate many values per thread and then perform a shared memory reduction. The reduction requires that the number of threads per block be a power of two: That usually gives you 32, 64, 128, 256, 512 (or 1024 if you have a Fermi or Kepler GPU). It is a very finite search space, just benchmark to see what works best on your hardware. You can find a more general discussion about block and grid sizing here and here.
Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.