declaring shared memory variables in a kernel - cuda

I have a question about how shared variables work.
When I declare a shared variable in a kernel like this
__shared__ int array1[N]
every unique shared memory of each active block now has an instance of array1 with size N. Meaning that every shared memory of each active block has now allocated N*sizeof(int) bytes.
And N*sizeof(int) must be at most 16KB for a gpu with compute capability 1.3.
So, assuming the above is correct and using 2D threads and 2D blocks assigned at host like this:
dim3 block_size(22,22);
dim3 grid_size(25,25);
I would have 25x25 instances of array1 with size N*sizeof(int) each and the most threads that could access each shared memory of a block is 22x22.
This was my original question and it was answered.
Q: When I assign a value to array1
array1[0]=1;
then do all active blocks assign that value instantly at their own shared memory?

Each block will always allocate its own shared memory array. So, if you launch 25x25 blocks, you will ultimately create 25x25 arrays in shared memory.
It does not mean, however, that all those arrays will exist at the same time, because it is not guaranteed that all blocks exist at the same time. Number of active blocks depends on the actual model of the GPU it is being run on. The GPU driver will try to launch as many as possible and the extra blocsk will run after previous ones end their work.
The maximum of N*sizeof(int) depends on Compute Capaiblity of your card and the L1-cache configuration. It can vary between: 8KB, 16KB, 32KB and 48KB.
To answer your last question - each shared array is visible by all threads belonging to the corresponding block. In your case each shared array will be visible by the corresponding 22x22 threads.

Related

Limiting register usage in CUDA: __launch_bounds__ vs maxrregcount

From the NVIDIA CUDA C Programming Guide:
Register usage can be controlled using the maxrregcount compiler
option or launch bounds as described in Launch Bounds.
From my understanding (and correct me if I'm wrong), while -maxrregcount limits the number of registers the entire .cu file may use, the __launch_bounds__ qualifier defines the maxThreadsPerBlock and minBlocksPerMultiprocessor for each __global__ kernel. These two accomplish the same task, but in two different ways.
My usage requires me to have 40 registers per thread to maximize the performance. Thus, I can use -maxrregcount 40. I can also force 40 registers by using __launch_bounds__(256, 6) but this causes load & store register spills.
What is the difference between the two to cause these register spills?
The preface of this question is that, quoting the CUDA C Programming Guide,
the fewer registers a kernel uses, the more threads and thread blocks
are likely to reside on a multiprocessor, which can improve
performance.
Now, __launch_bounds__ and maxregcount limit register usage by two different mechanisms.
__launch_bounds__
nvcc decides the number of registers to be used by a __global__ function through balancing the performance and the generality of the kernel launch setup. Saying it differently, such a choice of the number of used registers "guarantees effectiveness" for different numbers of threads per block and of blocks per multiprocessor. However, if an approximate idea of the maximum number of threads per block and (possibly) of the minimum number of blocks per multiprocessor is available at compile-time, then this information can be used to optimize the kernel for such launches. In other words
#define MAX_THREADS_PER_BLOCK 256
#define MIN_BLOCKS_PER_MP 2
__global__ void
__launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
fooKernel(int *inArr, int *outArr)
{
// ... Computation of kernel
}
informs the compiler of a likely launch configuration, so that nvcc can select the number of registers for such a launch configuration in an "optimal" way.
The MAX_THREADS_PER_BLOCK parameter is mandatory, while the MIN_BLOCKS_PER_MP parameter is optional. Also note that if the kernel is launched with a number of threads per block larger than MAX_THREADS_PER_BLOCK, the kernel launch will fail.
The limiting mechanism is described in the Programming Guide as follows:
If launch bounds are specified, the compiler first derives from them
the upper limit L on the number of registers the kernel should use
to ensure that minBlocksPerMultiprocessor blocks (or a single block
if minBlocksPerMultiprocessor is not specified) of
maxThreadsPerBlock threads can reside on the multiprocessor. The
compiler then optimizes register usage in the following way:
If the initial register usage is higher than L, the compiler reduces it further until it becomes less or equal to L, usually at
the expense of more local memory usage and/or higher number of
instructions;
Accordingly, __launch_bounds__ can lead to register spill.
maxrregcount
maxrregcount is a compiler flag that simply hardlimits the number of employed registers to a number set by the user, at variance with __launch_bounds__, by forcing the compiler to rearrange its use of registers. When the compiler can't stay below the imposed limit, it will simply spill it to local memory which is in fact DRAM. Even this local variables are stored in global DRAM memory variables can be cached in L1, L2.

How do I appropriately size and launch a CUDA grid?

First question:
Suppose I need to launch a kernel with 229080 threads on a Tesla C1060 which has compute capability 1.3.
So according to the documentation this machine has 240 cores with 8 cores on each symmetric multiprocessor for a total of 30 SMs.
I can use up to 1024 per SM for a total of 30720 threads running "concurrently".
Now if I define blocks of 256 threads that means I can have 4 blocks for each SM because 1024/256=4. So those 30720 threads can be arranged in 120 blocks across all SMs.
Now for my example of 229080 threads I would need 229080/256=~895 (rounded up) blocks to process all the threads.
Now lets say I want to call a kernel and I must use those 229080 threads so I have two options. The first one is to I divide the problem so that I call the kernel ~8 times in a for loop with a Grid of 120 blocks and 30720 threads each time (229080/30720). That way I make sure the device will stay occupied completely. The other option is to call the kernel with a Grid of 895 blocks for the entire 229080 threads on which case many blocks will remain idle until a SM finishes with the 8 blocks it has.
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
Second question
Let's say that within the kernel I'm calling I need to access non coalesced global memory so an option is to use shared memory.
I can then use each thread to extract a value from an array on global memory say global_array which is of length 229080. Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
The problem here is that for the 229080 threads I need exactly 229080/256=894.84375 blocks because there is a residue of 216 threads. I can round up that number and get 895 blocks and the last block will just use 216 threads.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
So which is the preferred option? does it make any difference for those blocks to remain idle waiting? do they take resources?
A single kernel is preferred according to what you describe. Threadblocks queued up but not assigned to an SM don't take any resources you need to worry about, and the machine is definitely designed to handle situations just like that. The overhead of 8 kernel calls will definitely be slower, all other things being equal.
Now as I understand correctly you have to avoid branching when copying to shared memory since all threads on a block need to reach the syncthreads() call to make sure they all can access the shared memory.
This statement is not correct on the face of it. You can have branching while copying to shared memory. You just need to make sure that either:
The __syncthreads() is outside the branching construct, or,
The __syncthreads() is reached by all threads within the branching construct (which effectively means that the branch construct evaluates to the same path for all threads in the block, at least at the point where the __syncthreads() barrier is.)
Note that option 1 above is usually achievable, which makes code simpler to follow and easy to verify that all threads can reach the barrier.
But since I need to extract the value to shared memory from global_array which is of length 229080 and I can't use a conditional statement to prevent the last 40 threads (256-216) from accessing illegal addresses on global_array then how can I circumvent this problem while working with shared memory loading?
Do something like this:
int idx = threadIdx.x + (blockDim.x * blockIdx.x);
if (idx < data_size)
shared[threadIdx.x] = global[idx];
__syncthreads();
This is perfectly legal. All threads in the block, whether they are participating in the data copy to shared memory or not, will reach the barrier.

Is changing memory within warp defined?

For now I use atomicAdd to change some memory cell. I am interested is the behaviour of changing the same memory (without atomicAdd) within warp defined? I have particular architecture in mind -- Fermi.
Let's say I have pointer to memory, the same for all 32 threads (same block), there is no more threads at all, and I perform:
++(*ptr);
Is this undefined? Defined?
If ptr refers to the same global or shared memory location across threads in a warp, then the behavior is undefined. That is to say, the indicated contents (i.e. *ptr) will be undefined, when the operation is complete.

Does __syncthreads() synchronize all threads in the grid?

...or just the threads in the current warp or block?
Also, when the threads in a particular block encounter (in the kernel) the following line
__shared__ float srdMem[128];
will they just declare this space once (per block)?
They all obviously operate asynchronously so if Thread 23 in Block 22 is the first thread to reach this line, and then Thread 69 in Block 22 is the last one to reach this line, Thread 69 will know that it already has been declared?
The __syncthreads() command is a block level synchronization barrier. That means it is safe to be used when all threads in a block reach the barrier. It is also possible to use __syncthreads() in conditional code but only when all threads evaluate identically such code otherwise the execution is likely to hang or produce unintended side effects [4].
Example of using __syncthreads(): (source)
__global__ void globFunction(int *arr, int N)
{
__shared__ int local_array[THREADS_PER_BLOCK]; //local block memory cache
int idx = blockIdx.x* blockDim.x+ threadIdx.x;
//...calculate results
local_array[threadIdx.x] = results;
//synchronize the local threads writing to the local memory cache
__syncthreads();
// read the results of another thread in the current thread
int val = local_array[(threadIdx.x + 1) % THREADS_PER_BLOCK];
//write back the value to global memory
arr[idx] = val;
}
To synchronize all threads in a grid currently there is not native API call. One way of synchronizing threads on a grid level is using consecutive kernel calls as at that point all threads end and start again from the same point. It is also commonly called CPU synchronization or Implicit synchronization. Thus they are all synchronized.
Example of using this technique (source):
Regarding the second question. Yes, it does declare the amount of shared memory specified per block. Take into account that the quantity of available shared memory is measured per SM. So one should be very careful how the shared memory is used along with the launch configuration.
I agree with all the answers here but I think we are missing one important point here w.r.t first question. I am not answering second answer as it got answered perfectly in the above answers.
Execution on GPU happens in units of warps. A warp is a group of 32 threads and at one time instance each thread of a particular warp execute the same instruction. If you allocate 128 threads in a block its (128/32 = ) 4 warps for a GPU.
Now the question becomes "If all threads are executing the same instruction then why synchronization is needed?". The answer is we need to synchronize the warps that belong to the SAME block. __syncthreads does not synchronizes threads in a warp, they are already synchronized. It synchronizes warps that belong to same block.
That is why answer to your question is : __syncthreads does not synchronizes all threads in a grid, but the threads belonging to one block as each block executes independently.
If you want to synchronize a grid then divide your kernel (K) into two kernels(K1 and K2) and call both. They will be synchronized (K2 will be executed after K1 finishes).
__syncthreads() waits until all threads within the same block has reached the command and all threads within a warp - that means all warps that belongs to a threadblock must reach the statement.
If you declare shared memory in a kernel, the array will only be visible to one threadblock. So each block will have his own shared memory block.
Existing answers have done a great job answering how __syncthreads() works (it allows intra-block synchronization), I just wanted to add an update that there are now newer methods for inter-block synchronization. Since CUDA 9.0, "Cooperative Groups" have been introduced, which allow synchronizing an entire grid of blocks (as explained in the Cuda Programming Guide). This achieves the same functionality as launching a new kernel (as mentioned above), but can usually do so with lower overhead and make your code more readable.
In order to provide further details, aside of the answers, quoting seibert:
More generally, __syncthreads() is a barrier primitive designed to protect you from read-after-write memory race conditions within a block.
The rules of use are pretty simple:
Put a __syncthreads() after the write and before the read when there is a possibility of a thread reading a memory location that another thread has written to.
__syncthreads() is only a barrier within a block, so it cannot protect you from read-after-write race conditions in global memory unless the only possible conflict is between threads in the same block. __syncthreads() is pretty much always used to protect shared memory read-after-write.
Do not use a __syncthreads() call in a branch or a loop until you are sure every single thread will reach the same __syncthreads() call. This can sometimes require that you break your if-blocks into several pieces to put __syncthread() calls at the top-level where all threads (including those which failed the if predicate) will execute them.
When looking for read-after-write situations in loops, it helps to unroll the loop in your head when figuring out where to put __syncthread() calls. For example, you often need an extra __syncthreads() call at the end of the loop if there are reads and writes from different threads to the same shared memory location in the loop.
__syncthreads() does not mark a critical section, so don’t use it like that.
Do not put a __syncthreads() at the end of a kernel call. There’s no need for it.
Many kernels do not need __syncthreads() at all because two different threads never access the same memory location.

Threads hierarchy design in kernel in CUDA

Assuming a block has limit of 512 threads, say my kernel needs more than 512 threads for execution, how should one design the thread hierarchy for optimal performance?
(case 1)
1st block - 512 threads
2nd block - remaining threads
(case 2) distribute equal number of threads across certain blocks.
I don't think that it really matters, but it is more important to group the thread blocks logically, so that you are able to use other CUDA optimizations (like memory coalescing)
This link provides some insight into how CUDA will (likely) and organize your threads.
A quote from the summary:
To summarize, special parameters at a
kernel launch define the dimensions of
a grid and its blocks. Unique
coordinates in blockId and threadId
variables allow threads of a grid to
distinguish among them. It is the
programmer's responsibility to use
these variables in the kernel
functions so that the threads can
properly identify the portion of the
data to process. These variables
compel the programmers to organize
threads and there data into
hierarchical and multi-dimensional
organizations.
It is preferable to divide equally the threads into two blocks, in order to maximize the computation / memory access overlap. When you have for instance 256 threads in a block, they do not compute all in the same time, there are scheduled on the SM by warp of 32 threads. When a warp is waiting for a global memory data, another warp is scheduled. If you have a small block of threads, your global memory accesses are a lot more penalizing.
Furthermore, in your example you underuse your GPU. Just remember that a GPU have dozens of multiprocessors (eg. 30 for the C1060 Tesla), and a block is mapped to a multiprocessor. In your case, you will only use 2 multiprocessors.