Persistent threads in OpenCL and CUDA - cuda

I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion?
What I keep in my mind after reading and googling "persistent threads":
Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works.
Is this correct? Thanks in advance
Reference: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089
http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing.pdf

CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32 threads: each warp operates in lock-step and executes exactly the
same instruction on different data.
Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A will not be replaced by another block B on the same SM until the last thread of block A will not have finished to execute.
Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.
The persistent threads technique is better illustrated by the following example, which has been taken from the presentation
“GPGPU” computing and the CUDA/OpenCL Programming Model
Another more detailed example is available in the paper
Understanding the efficiency of ray traversal on GPUs
// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available
// count represents the number of data to be processed
__global__ void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) < count)
{
load_locally(a[local_input_data_index]);
do_work_with_locally_loaded_data();
int out_index = read_and_increment(bhead);
write_result(b[out_index]);
}
}
// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);

Quite easy to understand. Usually each work item processed a small amount of work. If you want to save save workgroup switch time, you can let one work item process a lot of work using a loop. For instance, you have one image, and it is 1920x1080, you have 1920 workitem, and each work item processes one column of 1080 pixels using loop.

Related

How about the register resource situation when all threads quit(return) except one?

I'm writing a CUDA program with the dynamic parallelism mechanism, just like this:
{
if(tid!=0) return;
else{
anotherKernel<<<gridDim,blockDim>>>();
}
I know the parent kernel will not quit until the child kernel function finishes its work.is that mean other threads' register resource in the parent kernel(except tid==0) will not be retrieved? anyone can help me?
When and how a terminated thread's resources (e.g. register use) are returned to the machine for use by other blocks is unspecified, and empirically seems to vary by GPU architecture. The reasonable candidates here are that resources are returned at completion of the block, or at completion of the warp.
But that uncertainty need not go beyond the block level. A block that is fully retired returns its resources to the SM that it was resident on for future scheduling purposes. It does not wait for the completion of the kernel. This characteristic is self-evident(*) as being a necessity for the proper operation of a CUDA GPU.
Therefore for the example you have given, we can be sure that all threadblocks except the first threadblock will release their resources, at the point of the return statement. I cannot make specific claims about when exactly warps in the first threadblock may release their resources (except that when thread 0 terminates, resources will be released at that point, if not before).
(*) If it were not the case, a GPU would not be able to process a kernel with more than a relatively small number of blocks (e.g. for the latest GPUs, on the order of several thousand blocks.) Yet it is easy to demonstrate that even the smallest GPUs can process kernels with millions of blocks.

Can kernel change its block size?

The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.

Kepler CUDA dynamic parallelism and thread divergence

There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?
It allows recursion and lunching kernel from device code, does it mean that control path in different thread can be executed simultaneously?
Take a look to this paper
Dynamic parallelism, flow divergence and recursion are separated concepts. Dynamic parallelism is the ability to launch threads within a thread. This mean for example you may do this
__global__ void t_father(...) {
...
t_child<<< BLOCKS, THREADS>>>();
...
}
I personally investigated in this area, when you do something like this, when t_father launches the t_child, the whole vga resources are distributed again among those and t_father waits until all the t_child have finished before it can go on (look also this paper Slide 25)
Recursion is available since Fermi and is the ability for a thread to call itself without any other thread/block re-configuration
Regarding the flow divergence, I guess we will never see thread within a warp executing different code simultaneously..
No. Warp concept still exists. All the threads in a warp are SIMD (Single Instruction Multiple Data) that means at the same time, they run one instruction. Even when you call a child kernel, GPU designates one or more warps to your call. Have 3 things in your mind when you're using dynamic parallelism:
The deepest you can go is 24 (CC=3.5).
The number of dynamic kernels running at the same time is limited ( default 4096) but can be increased.
Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.
There's a sample cuda source in this NVidia presentation on slide 9.
__global__ void convolution(int x[])
{
for j = 1 to x[blockIdx]
kernel<<< ... >>>(blockIdx, j)
}
It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).
The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.

Forcing a CUDA thread block to yield

This question is related to: Does Nvidia Cuda warp Scheduler yield?
However, my question is about forcing a thread block to yield by doing some controlled memory operation (which is heavy enough to make the thread block yield). The idea is to allow another ready-state thread block to execute on the now vacant multiprocessor.
The PTX manual v2.3 mentions (section 6.6):
...Much of the delay to memory can be hidden in a number of ways. The first is to have multiple threads of execution so that the hardware can issue a memory operation and then switch to other execution. Another way to hide latency is to issue the load instructions as early as possible, as execution is not blocked until the desired result is used in a subsequent (in time) instruction...
So it sounds like this can be achieved (despite being an ugly hack). Has anyone tried something similar? Perhaps with block_size = warp_size kind of setting?
EDIT: I've raised this question without clearly understanding the difference between resident and non-resident (but assigned to the same SM) thread blocks. So, the question should be about switching between two resident (warp-sized) thread blocks. Apologies!
In the CUDA programming model as it stands today, once a thread block starts running on a multiprocessor, it runs to completion, occupying resources until it completes. There is no way for a thread block to yield its resources other than returning from the global function that it is executing.
Multiprocessors will switch among warps of all resident thread blocks automatically, so thread blocks can "yield" to other resident thread blocks. But a thread block can't yield to a non-resident thread block without exiting -- which means it can't resume.
Starting from compute capability 7 (Volta), you have the __nanosleep() instruction, which will put a thread to sleep for a given nanosecond duration.
Another option (available since compute capability 3.5) is to start a grid with lower priority using the cudaStreamCreateWithPriority() call. This allows you to run one stream at lower priority. Do note that (some) GPU's only have 2 priorities, meaning that you may have to run your main code at high priority in order to be able to dodge the default priority.
Here's a code snippet:
// get the range of stream priorities for this device
int priority_high, priority_low;
cudaDeviceGetStreamPriorityRange(&priority_low, &priority_high);
// create streams with highest and lowest available priorities
cudaStream_t st_high, st_low;
cudaStreamCreateWithPriority(&st_high, cudaStreamNonBlocking, priority_high);
cudaStreamCreateWithPriority(&st_low, cudaStreamNonBlocking, priority_low);

how many processors can I get in a block on cuda GPU?

I have three questions to ask
If I create only one block of threads in cuda and execute the parallel program on it then is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform ? To be more clear, If I use only one block of threads then how many processors will be allocated to it because so far as I know (I might have misunderstood it) one warp is given only single processing element.
can I synchronize the threads of different blocks ? if yes please give some hints to do it.
How to find out warp size ? it is fixed for a particular hardware ?
1 is it possible that more than one processors would be given to single block so that my program get some benefit of multiprocessor platform
Simple answer: No.
The CUDA programming model maps one threadblock to one multiprocessor (SM); the block cannot be split across two or more multiprocessors and, once started, it will not move from one multiprocessor to another.
As you have seen, CUDA provides __syncthreads() to allow threads within a block to synchronise. This is a very low cost operation, and that's partly because all the threads within a block are in close proximity (on the same SM). If they were allowed to split then this would no longer be possible. In addition, threads within a block can cooperate by sharing data in the shared memory; the shared memory is local to a SM and hence splitting the block would break this too.
2 can I synchronize the threads of different blocks ?
Not really no. There are some things you can do, like get the very last block to do something special (see the threadFenceReduction sample in the SDK) but general synchronisation is not really possible. When you launch a grid, you have no control over the scheduling of the blocks onto the multiprocessors, so any attempt to do global synchronisation would risk deadlock.
3 How to find out warp size ? it is fixed for a particular hardware ?
Yes, it is fixed. In fact, for all current CUDA capable devices (both 1.x and 2.0) it is fixed to 32. If you are relying on the warp size then you should ensure forward-compatibility by checking the warp size.
In device code you can just use the special variable warpSize. In host code you can query the warp size for a specific device with:
cudaError_t result;
int deviceID;
struct cudaDeviceProp prop;
result = cudaGetDevice(&deviceID);
if (result != cudaSuccess)
{
...
}
result = cudaGetDeviceProperties(&prop, deviceID);
if (result != cudaSuccess)
{
...
}
int warpSize = prop.warpSize;
As of cuda 2.3 one processor per thread block. It might be different in cuda 3/Fermi processors, I do not remember
not really but... (depending on your requirements you may find workaround)
read this post CUDA: synchronizing threads
#3. You can query SIMDWidth using cuDeviceGetProperties - see doc
To synchronize threads across multiple blocks (at least as far as memory updates are concerned), you can use the new __threadfence_system() call, which is only available on Fermi devices (Compute Capability 2.0 and better). This function is described in the CUDA Programming guide for CUDA 3.0.
Can I synchronize threads of different block with following approach. Please do tell me if there is any problem in this approch (I think there will be some but since I'm not much experienced in cuda I might have not considered some facts)
__global__ void sync_func(int *glob_var){
int i = 0 ; //local variable to each thread
int total_threads = blockDim.x *threadDim.x
while(*glob_var != total_threads){
if(i == 0){
atomicAdd(int *glob_var, 1);
i = 1;
}
}
execute the code which is to be executed at the same time by all threads;
}