Kernel calls in CUDA - cuda

I am new with CUDA, and I am confuse with the kernel calls.
When you call a Kernel method you specify the number of blocks and the thread per block, like this kernelMethod<<< block, Threads >>>(parameters);"
So why it is possible to use a 3rd parameter?
kernelMethod<<< block, Threads, ???>>>(parameters);
Using cudaDeviceProp you can read the number of thread per block in the variable maxThreadsPerBlock. But how can I know the maximum number of blocks?
Thanks!!

The third parameter specifies the amount of shared memory per block to be dynamically allocated. The programming guide provides additional detail about shared memory, as well as a description and example.
Shared memory can be allocated statically in a kernel:
__shared__ int myints[256];
or dynamically:
extern __shared__ int myints[];
In the latter case, it's necessary to pass as an additional kernel config parameter (the 3rd parameter you mention) the size of the shared memory to be allocated in bytes.
In that event, the pointer myints then points to the beginning of that dynamically allocated region.
The maximum number of blocks is specified per grid dimension (x, y, z) and can also be obtained through the device properties query. It is specified in the maxGridSize parameter. You may want to refer to the deviceQuery sample for a worked example.

Related

Dynamically allocated shared memory in CUDA. Execution Configuration

What does by this Nvidia means?
Ns is of type size_t and specifies the number of bytes in shared
memory that is dynamically allocated per block for this call in
addition to the statically allocated memory; this dynamically
allocated memory is used by any of the variables declared as an
external array as mentioned in __shared__; Ns is an optional
argument which defaults to 0;
Size of shared memory in my GPU is 48kB.
For example I want to run 4 kernels at the same time, every of them uses 12kB of shared memory.
In order to do that, should I call kernek this way
kernel<<< gridSize, blockSize, 12 * 1024 >>>();
or should the third argument be 48 * 1024 ?
Ns in a size in bytes. If you want to reserve 12kB of shared memory you would do 12*1024*1024.
I doubt you want to do this. Ns value is PER BLOCK. So it is the amount of shared memory per block executing on the device. I'm guessing you'd like to do something around the lines of 12*1024*1024/number_of_blocks;
Kernel launching with concurrency:
If as mentioned in a comment, you are using streams there is a fourth input in the kernel launch which is the cuda stream.
If you want to launch a kernel on another stream without any shared memory it will look like:
kernel_name<<<128, 128, 0, mystream>>>(...);
but concurrency is a whole different issue.

Understanding this CUDA kernels launch parameters

I am trying to analyze some code I have found online and I keep thinking myself into a corner. I am looking at a histogram kernel launched with the following parameters
histogram<<<2500, numBins, numBins * sizeof(unsigned int)>>>(...);
I know that the parameters are grid, block, shared memory sizes.
So does that mean that there are 2500 blocks of numBins threads each, each block also having a numBins * sizeof(unsigned int) chunk of shared memory available to its threads?
Also, within the kernel itself there are calls to __syncthreads(), are there then 2500 sets of numBins calls to __syncthreads() over the course of the kernel call?
So does that mean that there are 2500 blocks of numBins threads each,
each block also having a numBins * sizeof(unsigned int) chunk of
shared memory available to its threads?
From the CUDA Toolkit documentation:
The execution configuration (of a global function call) is specified by inserting an expression of the form <<<Dg,Db,Ns,S>>>, where:
Dg (dim3) specifies the dimension and size of the grid.
Db (dim3) specifies the dimension and size of each block
Ns (size_t) specifies the number of bytes in shared memory that is dynamically allocated per block for this call in addition to the statically allocated memory.
S (cudaStream_t) specifies the associated stream, is an optional parameter which defaults to 0.
So, as #Fazar pointed out, the answer is yes. This memory is allocated per block.
Also, within the kernel itself there are calls to __syncthreads(), are
there then 2500 sets of numBins calls to __syncthreads() over the
course of the kernel call?
__syncthreads() waits until all threads in the thread block have reached this point. Is used to coordinate the communication between threads in the same block.
So, there is a __syncthread() call per block.

Passing value from device memory as kernel parameter in CUDA

I'm writing a CUDA application that has a step where the variance of some complex-valued input data is computed, and then that variance is used to threshold the data. I've got a reduction kernel that computes the variance for me, but I'm not sure if I have to pull the value back to the host to pass it to the thresholding kernel or not.
Is there a way to pass the value directly from device memory?
You can use a __device__ variable to hold the variance value in-between kernel calls.
Put this before the definition of the kernels that use it:
__device__ float my_variance = 0.0f;
Variables defined this way can be used by any kernel executing on the device (without requiring that they be explicitly passed as a kernel function parameter) and persist for the lifetime of the context, i.e. beyond the lifetime of any single kernel call.
It's not entirely clear from your question, but you can also define an array of data this way.
__device__ float my_variance[32] = {0.0f};
Likewise, allocations created by cudaMalloc live for the duration of the application/context (or until an appropriate cudaFree is encountered) and so there is no need to "pull back the data" to the host if you want to use it in a successive kernel:
float *d_variance;
cudaMalloc((void **)&d_variance), sizeof(float));
my_reduction_kernel<<<...>>>(..., d_variance, ...);
my_thresholding_kernel<<<...>>>(..., d_variance, ...);
Any value set in *d_variance by the reduction kernel above will be properly observed by the thresholding kernel.

How to share a common value between threads in a given block?

I have a kernel that, for each thread in a given block, computes a for loop with a different number of iterations. I use a buffer of size N_BLOCKS to store the number of iterations required for each block. Hence, each thread in a given block must know the number of iterations specific to its block.
However, I'm not sure which way is the best (performance speaking) to read the value and distribute it to all the other threads. I see only one good way (please tell me if there is something better): store the value in shared memory and have each thread read it. For example:
__global__ void foo( int* nIterBuf )
{
__shared__ int nIter;
if( threadIdx.x == 0 )
nIter = nIterBuf[blockIdx.x];
__syncthreads();
for( int i=0; i < nIter; i++ )
...
}
Any other better solutions? My app will use a lot of data, so I want the best performance.
Thanks!
Read-only values that are uniform across all threads in a block are probably best stored in __constant__ arrays. On some CUDA architectures such as Fermi (SM 2.x), if you declare the array or pointer argument using the C++ const keyword AND you access it uniformly within the block (i.e. the index only depends on blockIdx, not threadIdx), then the compiler may automatically promote the reference to constant memory.
The advantage of constant memory is that it goes through a dedicated cache, so it doesn't pollute the L1, and if the amount of data you are accessing per block is relatively small, after the first access within each block, you should always hit in the cache after the initial compulsory miss in each thread block.
You also won't need to use any shared memory or transfer from global to shared memory.
If my info is up-to-date, the shared memory is the second fastest memory, second only to the registers.
If reading this data from shared memory every iteration slows you down and you still have registers available (refer to your GPU's compute capability and specs), you could perhaps try to store a copy of this value in every thread's register (using a local variable).

Variable Sizes Array in CUDA

Is there any way to declare an array such as:
int arraySize = 10;
int array[arraySize];
inside a CUDA kernel/function? I read in another post that I could declare the size of the shared memory in the kernel call and then I would be able to do:
int array[];
But I cannot do this. I get a compile error: "incomplete type is not allowed". On a side note, I've also read that printf() can be called from within a thread and this also throws an error: "Cannot call host function from inside device/global function".
Is there anything I can do to make a variable sized array or equivalent inside CUDA? I am at compute capability 1.1, does this have anything to do with it? Can I get around the variable size array declarations from within a thread by defining a typedef struct which has a size variable I can set? Solutions for compute capabilities besides 1.1 are welcome. This is for a class team project and if there is at least some way to do it I can at least present that information.
About the printf, the problem is it only works for compute capability 2.x. There is an alternative cuPrintf that you might try.
For the allocation of variable size arrays in CUDA you do it like this:
Inside the kernel you write extern __shared__ int[];
On the kernel call you pass as the third launch parameter the shared memory size in bytes like mykernel<<<gridsize, blocksize, sharedmemsize>>>();
This is explained in the CUDA C programming guide in section B.2.3 about the __shared__ qualifier.
If your arrays can be large, one solution would be to have one kernel that computes the required array sizes, stores them in an array, then after that invocation, the host allocates the necessary arrays and passes an array of pointers to the threads, and then you run your computation as a second kernel.
Whether this helps depends on what you have to do, because it would be arrays allocated in global memory. If the total size (per block) of your arrays is less than the size of the available shared memory, you could have a sufficiently-large shared memory array and let your threads negociate splitting it amongst themselves.