Using a loop in a CUDA graph - cuda

I have kernel A, B, and C which need to be executed sequentially.
A->B->C
They are executed in a while loop until some condition will be met.
while(predicate) {
A->B->C
}
The while loop may be executed from 3 to 2000 times - information about a fact that a loop should stopped is produced by kernel C.
As the execution is related to multiple invocations of relatively small kernels CUDA Graph sounds like a good idea. However, CUDA graph implementation I have seen are all linear or tree-like without loops.
Generally, if the loop is not possible, the long chain of kernels of the length 2000 with possibility of early stop invoked from kernel C would be also OK. However, is it possible to stop the graph execution in some position by the call from inside of the kernel?

CUDA graphs have no conditionals. A vertex of the graph is visited/executed when its predecessors are complete, and that's that. So, fundamentally, you cannot do this with a CUDA graph.
What can you do?
Have a smaller graph for the loop iteration, and repeatedly schedule it.
Have A, B and C start their execution by checking the loop predicate - and skip all work if it holds. With that being the case, you can schedule many instances of A->B->C->A->B->C etc - which, starting at some point, will do nothing.
Don't rely on the CUDA graphs API. It's not a general-purpose parallel execution mechanism. :-(

Related

Optimal use of GPU resources in case of many interdependent tasks

In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.

How do the nodes in a CUDA graph connect?

CUDA graphs are a new way to synthesize complex operations from multiple operations. With "stream capture", it appears that you can run a mix of operations, including CuBlas and similar library operations and capture them as a singe "meta-kernel".
What's unclear to me is how the data flow works for these graphs. In the capture phase, I allocate memory A for the input, memory B for the temporary values, and memory C for the output. But when I capture this in a graph, I don't capture the memory allocations. So when I then instantiate multiple copies of these graphs, they cannot share the input memory A, temporary workspace B or output memory C.
How then does this work? I.e. when I call cudaGraphLaunch, I don't see a way to provide input parameters. My captured graph basically starts with a cudaMemcpyHostToDevice, how does the graph know which host memory to copy and where to put it?
Background: I found that CUDA is heavily bottlenecked on kernel launches; my AVX2 code was 13x times slower when ported to CUDA. The kernels themselves seem fine (according to NSight), it's just the overhead of scheduling several hundred thousand kernel launches.
A memory allocation would typically be done outside of a graph definition/instantiation or "capture".
However, graphs provide for "memory copy" nodes, where you would typically do cudaMemcpy type operations.
At the time of graph definition, you pass a set of arguments for each graph node (which will depend on the node type, e.g. arguments for the cudaMemcpy operation, if it is a memory copy node, or kernel arguments if it is a kernel node). These arguments determine the actual memory allocations that will be used when that graph is executed.
If you wanted to use a different set of allocations, one method would be to instantiate another graph with different arguments for the nodes where there are changes. This could be done by repeating the entire process, or by starting with an existing graph, making changes to node arguments, and then instantiating a graph with those changes.
Currently, in cuda graphs, it is not possible to perform runtime binding (i.e. at the point of graph "launch") of node arguments to a particular graph/node. It's possible that new features may be introduced in future releases, of course.
Note that there is a CUDA sample code called simpleCudaGraphs available in CUDA 10 which demonstrates the use of both memory copy nodes, and kernel nodes, and also how to create dependencies (effectively execution dependencies) between nodes.

Difference between kernels construct and parallel construct

I study a lot of articles and the manual of OpenACC but still i don't understand the main difference of these two constructs.
kernels directive is the more general case and probably one that you might think of, if you've written GPU (e.g. CUDA) kernels before. kernels simply directs the compiler to work on a piece of code, and produce an arbitrary number of "kernels", of arbitrary "dimensions", to be executed in sequence, to parallelize/offload a particular section of code to the accelerator. The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator, for example by specifying specific dimensions of parallelization. For example, the number of workers and gangs would normally be constant as part of the parallel directive (since only one underlying "kernel" is usually implied), but perhaps not on the kernels directive (since it may translate to multiple underlying "kernels").
A good treatment of this specific question is contained in this PGI article.
Quoting from the article summary:
"The OpenACC kernels and parallel constructs each try to solve the same problem, identifying loop parallelism and mapping it to the machine parallelism. The kernels construct is more implicit, giving the compiler more freedom to find and map parallelism according to the requirements of the target accelerator. The parallel construct is more explicit, and requires more analysis by the programmer to determine when it is legal and appropriate. "
OpenACC directives and GPU kernels are just two ways of representing the same thing -- a section of code that can run in parallel.
OpenACC may be best when retrofitting an existing app to take advantage of a GPU and/or when it is desirable to let the compiler handle more details related to issues such as memory management. This can make it faster to write an app, with a potential cost in performance.
Kernels may be best when writing a GPU app from scratch and/or when more fine grained control is desired. This can make the app take longer to write, but may increase performance.
I think that people new to GPUs may be tempted to go with OpenACC because it looks more familiar. But I think it's actually better to go the other way, and start with writing kernels, and then, potentially move to OpenACC to save time in some projects. The reason is that OpenACC is a leaky abstraction. So, while OpenACC may make it look as if the GPU details are abstracted out, they are still there. So, using OpenACC to write GPU code without understanding what is happening in the background is likely to be frustrating, with odd error messages when attempting to compile, and result in an app that has low performance.
Parallel Construct
Defines the region of the program that should be compiled for parallel execution on the accelerator device.
The parallel loop directive is an assertion by the programmer that it is both safe and desirable to parallelize the affected loop. This relies on the programmer to have correctly identified parallelism in the code and remove anything in the code that may be unsafe to parallelize. If the programmer asserts incorrectly that the loop may be parallelized then the resulting application may produce incorrect results.
The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator. So it does not rely heavily on the compiler’s ability to automatically parallelize the code.
When parallel loop is used on two subsequent loops that access the same data a compiler may or may not copy the data back and forth between the host and the device between the two loops.
More experienced parallel programmers, who may have already identified parallel loops within their code, will likely find the parallel loop approach more desirable.
e.g refer
#pragma acc parallel
{
#pragma acc loop
for (i=0; i<n; i++)
a[i] = 3.0f*(float)(i+1);
#pragma acc loop
for (i=0; i<n; i++)
b[i] = 2.0f*a[i];
}
 Generate one kernel
 There is no barrier between the two loops: the
second loop may start before the first loop ends.
(This is different from OpenMP).
Kernels Construct
Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.
An important thing to note about the kernels construct is that the compiler will analyze the code and only parallelize when it is certain that it is safe to do so. In some cases, the compiler may not have enough information at compile time to determine whether a loop is safe the parallelize, in which case it will not parallelize the loop, even if the programmer can clearly see that the loop is safely parallel.
The kernels construct gives the compiler maximum leeway to parallelize and optimize the code how it sees fit for the target accelerator but also relies most heavily on the compiler’s ability to automatically parallelize the code.
One more notable benefit that the kernels construct provides is that if multiple loops access the same data it will only be copied to the accelerator once which may result in less data motion.
Programmers with less parallel programming experience or whose code contains a large number of loops that need to be analyzed may find the kernels approach much simpler, as it puts more of the burden on the compiler.
e.g refer
#pragma acc kernels
{
for (i=0; i<n; i++)
a[i] = 3.0f*(float)(i+1);
for (i=0; i<n; i++)
b[i] = 2.0f*a[i];
}
 Generate two kernels
 There is an implicit barrier between the two
loops: the second loop will start after the first
loop ends.

Communication between CUDA threads/thread blocks

I am trying to "map" a few tasks to CUDA GPU. There are n tasks to process. (See the pseudo-code)
malloc an boolean array flag[n] and initialize it as false.
for each work-group in parallel do
while there are still unfinished tasks do
Do something;
for a few j_1, j_2, .. j_m (j_i<k) do
Wait until task j_i is finished; [ while(flag[j_i]) ; ]
Do Something;
end for
Do something;
Mark task k finished; [ flag[k] = true; ]
end while
end for
For some reason, I will have to use threads in different thread block.
The question is how to implement the Wait until task j_i is finished; and Mark task k finished; in CUDA. My implementation is to use an boolean array as the flag. Then set flag once a task is done, and read the flag to check if a task is done.
But it only works on small case, one large case, the GPU get crashed with unknown reason. Is there any better way to implement the Wait and Mark in CUDA.
That's basically a problem of inter-thread communication on CUDA.
Synchronising within a threadblock is straightforward using __syncthreads(). However synchronising between threadblocks is more tricky - the programming model method is to break into two kernels.
If you think about it, it makes sense. The execution model (for both CUDA and OpenCL) is for a whole bunch of blocks executing on processing units, but says nothing about when. This means that some blocks will be executing but others will not (they'll be waiting). So if you have a __syncblocks() then you would risk deadlock, since those already executing will stop, but those not executing will never reach the barrier.
You can share information between blocks (using global memory and atomics, for example), but not global synchronisation.
Depending on what you're trying to do, there is frequently another way of solving or breaking down the problem.
What you're asking for is not easily done since thread blocks can be scheduled in any order, and there is no easy way to synchronize or communicate between them. From the CUDA Programming Guide:
For the parallel workloads, at points in the algorithm where parallelism is broken because some threads need to synchronize in order to share data with each other, there are two cases: Either these threads belong to the same block, in which case they should use __syncthreads() and share data through shared memory within the same kernel invocation, or they belong to different blocks, in which case they must share data through global memory using two separate kernel invocations, one for writing to and one for reading from global memory. The second case is much less optimal since it adds the overhead of extra kernel invocations and global memory traffic. Its occurrence should therefore be minimized by mapping the algorithm to the CUDA programming model in such a way that the computations that require inter-thread communication are performed within a single thread block as much as possible.
So if you can't fit all the communication you need within a thread block, you would need to have multiple kernel calls in order to accomplish what you want.
I don't believe there is any difference with OpenCL, but I also don't work in OpenCL.
This kind of problems is best solved by a slightly different approach:
Don't assign fixed tasks to your threads, forcing your threads to wait until their task becomes available (which isn't possible in CUDA since threads can't block).
Instead, keep a list of available tasks (using atomic operations) and have each thread grab a task from that list.
This is still tricky to implement and get the corner cases right, but at least it's possible.
I think you dont need to implement in CUDA. Every thing can be implemented on CPU. You are waiting for a task to complete, then doing another task randomly. If you want to implement in CUDA, you dont need to wait for all the flags to be true. You know initially that all the flags are false. So just implement Do something in parallel for all the thread and change the flag to true.
If you want to implement in CUDA, take int flag and keep on adding 1 it after finishing Do something so that you can know the change in flag before and after doing Do something.
If i got your question wrong, please comment. I'll try to improve the answer.

How to calculate total time for CPU + GPU

I am doing some computation on the CPU and then I transfer the numbers to the GPU and do some work there. I want to calculate the total time taken to do the computation on the CPU + the GPU. how do i do so?
When your program starts, in main(), use any system timer to record the time. When your program ends at the bottom of main(), use the same system timer to record the time. Take the difference between time2 and time1. There you go!
There are different system timers you can use, some with higher resolution than others. Rather than discuss those here, I'd suggest you search for "system timer" on the SO site. If you just want any system timer, gettimeofday() works on Linux systems, but it has been superseded by newer, higher-precision functions. As it is, gettimeofday() only measures time in microseconds, which should be sufficient for your needs.
If you can't get a timer with good enough resolution, consider running your program in a loop many times, timing the execution of the loop, and dividing the measured time by the number of loop iterations.
EDIT:
System timers can be used to measure total application performance, including time used during the GPU calculation. Note that using system timers in this way applies only to real, or wall-clock, time, rather than process time. Measurements based on the wall-clock time must include time spent waiting for GPU operations to complete.
If you want to measure the time taken by a GPU kernel, you have a few options. First, you can use the Compute Visual Profiler to collect a variety of profiling information, and although I'm not sure that it reports time, it must be able to (that's a basic profiling function). Other profilers - PAPI comes to mind - offer support for CUDA kernels.
Another option is to use CUDA events to record times. Please refer to the CUDA 4.0 Programming Guide where it discusses using CUDA events to measure time.
Yet another option is to use system timers wrapped around GPU kernel invocations. Note that, given the asynchronous nature of kernel invocation returns, you will also need to follow the kernel invocation with a host-side GPU synchronization call such as cudaThreadSynchronize() for this method to be applicable. If you go with this option, I highly recommend calling the kernel in a loop, timing the loop + one synchronization at the end (since synchronization occurs between kernel calls not executing in different streams, cudaThreadSynchronize() is not needed inside the loop), and dividing by the number of iterations.
The C timer moves on regardless of GPU is working or not. If you don't believe me then do this little experiment: Make a for loop with 1000 iterations over GPU_Function_Call. Put any C timer around that for loop. Now when you run the program (suppose GPU function takes substantial time like 20ms) you will see it running for few seconds with the naked eye before it returns. But when you print the C time you'll notice it'll show you like few miliseconds. This is because the C timer didn't wait for 1000 MemcpyHtoD and 1000 MemcpyfromDtoH and 1000 kernel calls.
What I suggest is to use CUDA event timer or even better NVIDIA Visual Profiler to time GPU and use stop watch (increase the iterations to reduce human error) to measure the complete time. Then just subtract the GPU time from total to get the CPU time.