Multiple Child Kernels running in Parallel in CUDA - cuda

_global__ ChildKernel1(void* data){
//Operate on data
}
_global__ ChildKernel2(void* data){
//Operate on data
}
_global__ ChildKernel3(void* data){
//Operate on data
}
__global__ ParentKernel(void *data){
ChildKernel1<<<16, 2>>>(data);
ChildKernel2<<<64, 3>>>(data);
ChildKernel3<<<32, 4>>>(data);
ChildKernel4<<<16, 5>>>(data);
}
// In Host Code
ParentKernel<<<256, 64>>(data);
I want to run all the child kernels in parallel. So what do I do?

Have you read the dynamic parallelism section of the programming guide?
As you've shown it, each thread in your ParentKernel will execute the code consisting of the 4 calls to child kernels. This complicates the answer.
So, with respect to the individual threads, yes, the various kernels may run in parallel, between threads.
But remember that cuda kernels issued by a given thread into the same stream will be serialized. Therefore, with respect to each individual thread in ParentKernel, the individual child kernels launched from that thread will be serialized.
To get the kernels in a single thread to have the possibility to run in parallel, launch them into separate streams.
Finally, the big caveat: Just like asynchronous concurrent execution of kernels launched from the host, device side kernels may only run "in parallel" as resources permit. There is no guarantee of parallel execution of kernels.

Related

How can I pause a CUDA stream and then resume it?

Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the instruction in the host code?
I have no idea how to write a sample code in this case, for example, to continue this question.
Exactly my question is whether there is an instruction in CUDA that can pause a CUDA kernel running in a CUDA stream and then resume it?
You can use dynamic parallelism with parameters for communication with host for the signals. Then launch a parent kernel with only 1 cuda thread and let it launch child kernels continuously until work is done or signal is received. If child kernel does not fully occupy the GPU, then it will lose performance.
__global__ void parent(int * atomicSignalPause, int * atomicSignalExit, Parameters * prm)
{
int progress = 0;
while(checkSignalExit(atomicSignalExit) && progress<100)
{
while(checkSignalPause(atomicSignalPause))
{
child<<<X,Y>>>(prm,progress++);
cudaDeviceSynchronize();
}
}
}
There is no command to pause a stream. For multiple GPUs, you should use unified memory allocation for the communication (between GPUs).
To overcome the gpu utilization issue, you may invent a task queue for child kernels. It pushes work N times (roughly enough to keep GPU efficient in power/compute), then for every completed child kernel it increments a dedicated counter in the parent kernel and pushes a new work, until all work is complete (while trying to keep concurrent kernels at N).
Maybe something like this:
// producer kernel
// N: number of works that make gpu fully utilized
while(hasWork)
{
// concurrency is a global parameter
while(checkConcurrencyAtomic(concurrency)<N)
{
incrementConcurrencyAtomic(concurrency);
// a "consumer" parent kernel will get items from queue
// it will decrement concurrency when a work is done
bool success = myQueue.tryPush(work, concurrency);
if(success)
{
// update status of whole work or signal the host
}
}
// synchronization once per ~N work
cudaDeviceSynchronize();
... then check for pause signals and other tasks
}
If total work takes more than a few seconds, these atomic value updates shouldn't be a performance problem but if you have way too many child kernels to launch then you can launch more producer/consumer (parent) cuda-threads.

CUDA: what does a stream abstract?

In the cuda C programming guide, stream is defined very abstractly: a sequence of cuda operations that are executed in order they are issued by the code.
My understanding of how instructions are executed in Nvidia GPU is: when a kernel is launched, the blocks are distributed to SMs in the device. Then the warps ( groups of 32 threads ) are schedueled by a warp schedueler in the SM for instructions to be processed warp-wise.
So, if two kernels are launched in the same stream, then the first is processed before the second ( since the instructions are processed in the order they are put in the stream ). Does that mean two kernels end up only using hardware resource of one kernel? Or does each kernel have their own resources, but the second one is pending until the first is complete?
And in general, how are streams implemented in hardware? I assume it provides ordering to the warp scheduler ( but then a warp scheduler is per-SM based, so how would this allow multi-SM kernels to use stream?).
A CUDA stream is merely a queue of actions to be performed by the GPU.
Every function through API can be issued in an asynchronous way - the CPU code continues while the instruction waits to be executed independently from the host code. Still, it is executed sychronously with respect to other instructions in the queue/stream.
If you want multiple operations on the GPU to be executed asynchronously, you need two or more queues/streams. For example, there is a chapter in the CUDA manual on how to mix kernel execution (first stream) with memory transfers (second stream).

What happen if a CUDA kernel is called from multiple pthreads simultaneously?

I have a CUDA kernel that do my hard work, but I also have some hard work that need to be done in the CPU (calculations with two positions of the same array) that I could not write in CUDA (because CUDA threads are not synchronous, I need to perform a hard work on a position X of an array and after do z[x] = y[x] - y[x - 1], where y is the array resultant of a CUDA kernel where each thread works on one position of this array and z is another array storing the result). So I'm doing this in the CPU.
I have several CPU threads to do the CPU side work, but each one is calling a CUDA kernel passing some data. My question is: what happens on the GPU side when multiple CPU threads are making GPU calls? Would be better if I do the CUDA kernel call once and then create multiple CPU threads to do the CPU side work?
Kernel calls are queued and executed one by one in single stream.
However you can specify stream during kernel execution - then CUDA operations in different streams may run concurrently and operations from different streams may be interleaved.
Default stream is 0.
See: CUDA Streams and Concurrency
Things are similar when different processes use the same card.
Also remember that kernels are executed asynchronously from CPU stuff.
On CUDA 4.0 and later, multiple threads can share the same CUDA context, hence no need for cuPush/PopContext anymore. You just need to call cudaSetDevice for each thread. Then, mentioned #dzonder, you can run multple kernels simulatenously from different threads with streams.

Kepler CUDA dynamic parallelism and thread divergence

There is very little information on dynamic parallelism of Kepler, from the description of this new technology, does it mean the issue of thread control flow divergence in the same warp is solved?
It allows recursion and lunching kernel from device code, does it mean that control path in different thread can be executed simultaneously?
Take a look to this paper
Dynamic parallelism, flow divergence and recursion are separated concepts. Dynamic parallelism is the ability to launch threads within a thread. This mean for example you may do this
__global__ void t_father(...) {
...
t_child<<< BLOCKS, THREADS>>>();
...
}
I personally investigated in this area, when you do something like this, when t_father launches the t_child, the whole vga resources are distributed again among those and t_father waits until all the t_child have finished before it can go on (look also this paper Slide 25)
Recursion is available since Fermi and is the ability for a thread to call itself without any other thread/block re-configuration
Regarding the flow divergence, I guess we will never see thread within a warp executing different code simultaneously..
No. Warp concept still exists. All the threads in a warp are SIMD (Single Instruction Multiple Data) that means at the same time, they run one instruction. Even when you call a child kernel, GPU designates one or more warps to your call. Have 3 things in your mind when you're using dynamic parallelism:
The deepest you can go is 24 (CC=3.5).
The number of dynamic kernels running at the same time is limited ( default 4096) but can be increased.
Keep parent kernel busy after child kernel call otherwise with a good chance you waste resources.
There's a sample cuda source in this NVidia presentation on slide 9.
__global__ void convolution(int x[])
{
for j = 1 to x[blockIdx]
kernel<<< ... >>>(blockIdx, j)
}
It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).
The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.

Parallelism in GPU - CUDA / OpenCL

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.
I read briefly in the Cuda programming guide, but did not find related answers hence asking here.
I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B)
and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in
parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.
OR
It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism.
If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?
Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?
Any reading material would help.
Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.
A sample kernel call looks like this:
helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);
OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.
The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.
What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?
Predominantly data parallelism, but there's also some task parallelism involved.
In your image processing example a kernel might do the processing for a single output pixel. You'd instruct OpenCL or CUDA to run as many threads as there are pixels in the output image. It then schedules those threads to run on the GPU/CPU that you're targeting.
Highly data parallel. Kernel is written to do a single work item, and you schedule millions of them.
The task parallelism comes in because your host program is still running on the CPU whilst the GPU is running all those threads, so it can be getting on with other work. Often this is preparing data for the next set of kernel threads, but it could be a completely separate task.
If you launch multiple kernels, they will not be automatically be parallelized (i.e. no GPU task parallelism). However, the kernel invocation is asynchronous on the host side, so host code will continue running in parallel while the kernel is executing.
To get task parallelism you have to do it by hand - in Cuda the concept is called streams, and in OpenCL command queues. Without explicitly creating multiple streams/queues and scheduling each kernel to its own queue, they will be executed in sequence (there is an OpenCL feature allowing queues to run out-of-order, but I don't know if any implementation supports it.) However, running the kernels in parallel will probably not give much benefit if each dataset is large enough to utilize all the GPU cores.
If you have actual for loops in your kernels, they will not in themselves be parallelized, the parallelism comes from specifying a grid size, which will cause the kernel to be invoked in parallel for each element in that grid (so if you have for loops inside your kernel they will be executed in full by each thread). In other words, you should specify a grid size when calling the kernel, and inside the kernel use threadIdx/blockIdx (Cuda) or getGlobalId() (OpenCL) to identify which data item to process in that particular thread.
A useful book for learning OpenCL is the OpenCL Programming Guide, but the OpenCL spec is also worth a look.