Kepler CUDA dynamic parallelism and thread divergence - cuda

There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?
It allows recursion and lunching kernel from device code, does it mean that control path in different thread can be executed simultaneously?

Take a look to this paper
Dynamic parallelism, flow divergence and recursion are separated concepts. Dynamic parallelism is the ability to launch threads within a thread. This mean for example you may do this
__global__ void t_father(...) {
...
t_child<<< BLOCKS, THREADS>>>();
...
}
I personally investigated in this area, when you do something like this, when t_father launches the t_child, the whole vga resources are distributed again among those and t_father waits until all the t_child have finished before it can go on (look also this paper Slide 25)
Recursion is available since Fermi and is the ability for a thread to call itself without any other thread/block re-configuration
Regarding the flow divergence, I guess we will never see thread within a warp executing different code simultaneously..

No. Warp concept still exists. All the threads in a warp are SIMD (Single Instruction Multiple Data) that means at the same time, they run one instruction. Even when you call a child kernel, GPU designates one or more warps to your call. Have 3 things in your mind when you're using dynamic parallelism:
The deepest you can go is 24 (CC=3.5).
The number of dynamic kernels running at the same time is limited ( default 4096) but can be increased.
Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.

There's a sample cuda source in this NVidia presentation on slide 9.
__global__ void convolution(int x[])
{
for j = 1 to x[blockIdx]
kernel<<< ... >>>(blockIdx, j)
}
It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).
The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.

Related

How about the register resource situation when all threads quit(return) except one?

I'm writing a CUDA program with the dynamic parallelism mechanism, just like this:
{
if(tid!=0) return;
else{
anotherKernel<<<gridDim,blockDim>>>();
}
I know the parent kernel will not quit until the child kernel function finishes its work.is that mean other threads' register resource in the parent kernel(except tid==0) will not be retrieved? anyone can help me?
When and how a terminated thread's resources (e.g. register use) are returned to the machine for use by other blocks is unspecified, and empirically seems to vary by GPU architecture. The reasonable candidates here are that resources are returned at completion of the block, or at completion of the warp.
But that uncertainty need not go beyond the block level. A block that is fully retired returns its resources to the SM that it was resident on for future scheduling purposes. It does not wait for the completion of the kernel. This characteristic is self-evident(*) as being a necessity for the proper operation of a CUDA GPU.
Therefore for the example you have given, we can be sure that all threadblocks except the first threadblock will release their resources, at the point of the return statement. I cannot make specific claims about when exactly warps in the first threadblock may release their resources (except that when thread 0 terminates, resources will be released at that point, if not before).
(*) If it were not the case, a GPU would not be able to process a kernel with more than a relatively small number of blocks (e.g. for the latest GPUs, on the order of several thousand blocks.) Yet it is easy to demonstrate that even the smallest GPUs can process kernels with millions of blocks.

Can kernel change its block size?

The title can't hold the whole question: I have a kernel doing a stream compaction, after which it continues using less number of threads.
I know one way to avoid execution of unused threads: returning and executing a second kernel with smaller block size.
What I'm asking is, provided unused threads diverge and end (return), and provided they align in complete warps, can I safely assume they won't waste execution?
Is there a common practice for this, other than splitting in two consecutive kernel execution?
Thank you very much!
The unit of execution scheduling and resource scheduling within the SM is the warp - groups of 32 threads.
It is perfectly legal to retire threads in any order using return within your kernel code. However there are at least 2 considerations:
The usage of __syncthreads() in device code depends on having every thread in the block participating. So if a thread hits a return statement, that thread could not possibly participate in a future __syncthreads() statement, and so usage of __syncthreads() after one or more threads have retired is illegal.
From an execution efficiency standpoint (and also from a resource scheduling standpoint, although this latter concept is not well documented and somewhat involved to prove), a warp will still consume execution (and other) resources, until all threads in the warp have retired.
If you can retire your threads in warp units, and don't require the usage of __syncthreads() you should be able to make fairly efficient usage of the GPU resources even in a threadblock that retires some warps.
For completeness, a threadblock's dimensions are defined at kernel launch time, and they cannot and do not change at any point thereafter. All threadblocks have threads that eventually retire. The concept of retiring threads does not change a threadblock's dimensions, in my usage here (and consistent with usage of __syncthreads()).
Although probably not related to your question directly, CUDA Dynamic Parallelism could be another methodology to allow a threadblock to "manage" dynamically varying execution resources. However for a given threadblock itself, all of the above comments apply in the CDP case as well.

Persistent threads in OpenCL and CUDA

I have read some papers talking about "persistent threads" for GPGPU, but I don't really understand it. Can any one give me an example or show me the use of this programming fashion?
What I keep in my mind after reading and googling "persistent threads":
Presistent Threads it's no more than a while loop that keep thread running and computing a lot of bunch of works.
Is this correct? Thanks in advance
Reference: http://www.idav.ucdavis.edu/publications/print_pub?pub_id=1089
http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0157-GTC2012-Persistent-Threads-Computing.pdf
CUDA exploits the Single Instruction Multiple Data (SIMD) programming model. The computational threads are organized in blocks and the thread blocks are assigned to a different Streaming Multiprocessor (SM). The execution of a thread block on a SM is performed by arranging the threads in warps of 32 threads: each warp operates in lock-step and executes exactly the
same instruction on different data.
Generally, to fill up the GPU, the kernel is launched with much more blocks that can actually be hosted on the SMs. Since not all the blocks can be hosted on a SM, a work scheduler performs a context switch when a block has finished computing. It should be noticed that the switching of the blocks is managed entirely in hardware by the scheduler, and the programmer has no means of influencing how blocks are scheduled onto the SM. This exposes a limitation for all those algorithms that do not perfectly fit a SIMD programming model and for which there is work imbalance. Indeed, a block A will not be replaced by another block B on the same SM until the last thread of block A will not have finished to execute.
Although CUDA does not expose the hardware scheduler to the programmer, the persistent threads style bypasses the hardware scheduler by relying on a work queue. When a block finishes, it checks the queue for more work and continues doing so until no work is left, at which point the block retires. In this way, the kernel is launched with as many blocks as the number of available SMs.
The persistent threads technique is better illustrated by the following example, which has been taken from the presentation
“GPGPU” computing and the CUDA/OpenCL Programming Model
Another more detailed example is available in the paper
Understanding the efficiency of ray traversal on GPUs
// Persistent thread: Run until work is done, processing multiple work per thread
// rather than just one. Terminates when no more work is available
// count represents the number of data to be processed
__global__ void persistent(int* ahead, int* bhead, int count, float* a, float* b)
{
int local_input_data_index, local_output_data_index;
while ((local_input_data_index = read_and_increment(ahead)) < count)
{
load_locally(a[local_input_data_index]);
do_work_with_locally_loaded_data();
int out_index = read_and_increment(bhead);
write_result(b[out_index]);
}
}
// Launch exactly enough threads to fill up machine (to achieve sufficient parallelism
// and latency hiding)
persistent<<numBlocks,blockSize>>(ahead_addr, bhead_addr, total_count, A, B);
Quite easy to understand. Usually each work item processed a small amount of work. If you want to save save workgroup switch time, you can let one work item process a lot of work using a loop. For instance, you have one image, and it is 1920x1080, you have 1920 workitem, and each work item processes one column of 1080 pixels using loop.

When to call cudaDeviceSynchronize?

when is calling to the cudaDeviceSynchronize function really needed?.
As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize after each kernel launch. However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize, except one before the time measurement. I have found that I get the same result but with a speed up between 7-12x (depending on the matrix sizes).
So, the question is if there are any reasons to use cudaDeviceSynchronize apart of time measurement.
For example:
Is it needed before copying data from the GPU back to the host with cudaMemcpy?
If I do matrix multiplications like
C = A * B
D = C * F
should I put cudaDeviceSynchronize between both?
From my experiment It seems that I don't.
Why does cudaDeviceSynchronize slow the program so much?
Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is the default behavior) are executed sequentially.
So, for example,
kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until memory is copied, memory copy starts only after kernel2 finishes
So in your example, there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).
cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or maybe the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.
One situation where using cudaDeviceSynchronize() is appropriate would be when you have several cudaStreams running, and you would like to have them exchange some information. A real-life case of this is parallel tempering in quantum Monte Carlo simulations. In this case, we would want to ensure that every stream has finished running some set of instructions and gotten some results before they start passing messages to each other, or we would end up passing garbage information. The reason using this command slows the program so much is that cudaDeviceSynchronize() forces the program to wait for all previously issued commands in all streams on the device to finish before continuing (from the CUDA C Programming Guide). As you said, kernel execution is normally asynchronous, so while the GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting. However when you use this synchronization command, the CPU is instead forced to idle until all the GPU work has completed before doing anything else. This behaviour is useful when debugging, since you may have a segfault occuring at seemingly "random" times because of the asynchronous execution of device code (whether in one stream or many). cudaDeviceSynchronize() will force the program to ensure the stream(s)'s kernels/memcpys are complete before continuing, which can make it easier to find out where the illegal accesses are occuring (since the failure will show up during the sync).
When you want your GPU to start processing some data, you typically do a kernal invocation.
When you do so, your device (The GPU) will start to doing whatever it is you told it to do. However, unlike a normal sequential program on your host (The CPU) will continue to execute the next lines of code in your program. cudaDeviceSynchronize makes the host (The CPU) wait until the device (The GPU) have finished executing ALL the threads you have started, and thus your program will continue as if it was a normal sequential program.
In small simple programs you would typically use cudaDeviceSynchronize, when you use the GPU to make computations, to avoid timing mismatches between the CPU requesting the result and the GPU finising the computation. To use cudaDeviceSynchronize makes it alot easier to code your program, but there is one major drawback: Your CPU is idle all the time, while the GPU makes the computation. Therefore, in high-performance computing, you often strive towards having your CPU making computations while it wait for the GPU to finish.
You might also need to call cudaDeviceSynchronize() after launching kernels from kernels (Dynamic Parallelism).
From this post CUDA Dynamic Parallelism API and Principles:
If the parent kernel needs results computed by the child kernel to do its own work, it must ensure that the child grid has finished execution before continuing by explicitly synchronizing using cudaDeviceSynchronize(void). This function waits for completion of all grids previously launched by the thread block from which it has been called. Because of nesting, it also ensures that any descendants of grids launched by the thread block have completed.
...
Note that the view of global memory is not consistent when the kernel launch construct is executed. That means that in the following code example, it is not defined whether the child kernel reads and prints the value 1 or 2. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.
__device__ int v = 0;
__global__ void child_k(void) {
printf("v = %d\n", v);
}
__global__ void parent_k(void) {
v = 1;
child_k <<< 1, 1 >>>> ();
v = 2; // RACE CONDITION
cudaDeviceSynchronize();
}

Parallelism in GPU - CUDA / OpenCL

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.
Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.
If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.