Does cudnnCreate() call create multiple streams internally? - cuda

I am writing a simple multi-stream CUDA application. Following is the part of code where I create cuda-streams, cublas-handle and cudnn-handle:
cudaSetDevice(0);
int num_streams = 1;
cudaStream_t streams[num_streams];
cudnnHandle_t mCudnnHandle[num_streams];
cublasHandle_t mCublasHandle[num_streams];
for (int ii = 0; ii < num_streams; ii++) {
cudaStreamCreateWithFlags(&streams[ii], cudaStreamNonBlocking);
cublasCreate(&mCublasHandle[ii]);
cublasSetStream(mCublasHandle[ii], streams[ii]);
cudnnCreate(&mCudnnHandle[ii]);
cudnnSetStream(mCudnnHandle[ii], streams[ii]);
}
Now, my stream count is 1. But when I profile the executable of above application using Nvidia Visual Profiler I get following:
For every stream I create it creates additional 4 more streams. I tested it with num_streams = 8, it showed 40 streams in profiler. It raised following questions in my mind:
Does cudnn internally create streams? If yes, then why?
If it implicitly creates streams then what is the way to utilize it?
In such case does explicitly creating streams make any sense?

Does cudnn internally create streams?
Yes.
If yes, then why?
Because it is a library, and it may need to organize CUDA concurrency. Streams are used to organize CUDA concurrency. If you want a detailed explanation of what exactly the streams are used for, the library internals are not documented.
If it implicitly creates streams then what is the way to utilize it?
Those streams are not intended for you to utilize separately/independently. They are for usage by the library, internal to the library routines.
In such case does explicitly creating streams make any sense?
You would still need to explicitly create any streams you needed to manage CUDA concurrency outside of the library usage.
I would like to point out that this statement is a bit misleading:
"For every stream I create it creates additional 4 more streams."
What you are doing is going through a loop, and at each loop iteration you are creating a new handle. Your observation is tied to the number of handles you create, not the number of streams you create.

Related

Optimal use of GPU resources in case of many interdependent tasks

In my use case, the global GPU memory has many chunks of data. Preferably, the number of these could change, but assuming the number and sizes of these chunks of data to be constant is fine as well. Now, there are a set of functions that take as input some of the chunks of data and modify some of them. Some of these functions should only start processing if others completed already. In other words, these functions could be drawn in graph form with the functions being the nodes and edges being dependencies between them. The ordering of these tasks is quite weak though.
My question is now the following: What is (on a conceptual level) a good way to implement this in CUDA?
An idea that I had, which could serve as a starting point, is the following: A single kernel is launched. That single kernel creates a grid of blocks with the blocks corresponding to the functions mentioned above. Inter-block synchronization ensures that blocks only start processing data once their predecessors completed execution.
I looked up how this could be implemented, but I failed to figure out how inter-block synchronization can be done (if this is possible at all).
I would create for any solution an array in memory 500 node blocks * 10,000 floats (= 20 MB) with each 10,000 floats being stored as one continuous block. (The number of floats be better divisible by 32 => e.g. 10,016 floats for memory alignment reasons).
Solution 1: Runtime Compilation (sequential, but optimized)
Use Python code to generate a sequential order of functions according to the graph and create (printing out the source code into a string) a small program which calls the functions in turn. Each function should read the input from its predecessor blocks in memory and store the output in its own output block. Python should output the glue code (as string) which calls all functions in the correct order.
Use NVRTC (https://docs.nvidia.com/cuda/nvrtc/index.html, https://github.com/NVIDIA/pynvrtc) for runtime compilation and the compiler will optimize a lot.
A further optimization would be to not store the intermediate results in memory, but in local variables. They will be enough for all your specified cases (Maximum of 255 registers per thread). But of course makes the program (a small bit) more complicated. The variables can be freely named. And you can have 500 variables. The compiler will optimize the assignment to registers and reusing registers. So have one variable for each node output. E.g. float node352 = f_352(node45, node182, node416);
Solution 2: Controlled run on device (sequential)
The python program creates a list with the order, in which the functions have to be called. The individual functions know, from what memory blocks to read and in what block to write (either hard-coded, or you have to submit it to them in a memory structure).
On the device kernel a for loop is run, where the order list is went through sequentially and the kernel from the list is called.
How to specify, which functions to call?
The function pointers in the list can be created on the CPU like the following code: https://leimao.github.io/blog/Pass-Function-Pointers-to-Kernels-CUDA/ (not sure, if it works in Python).
Or regardless of host programming language a separate kernel can create a translation table: device function pointers (assign_kernel). Then the list from Python would contain indices into this table.
Solution 3: Dynamic Parallelism (parallel)
With Dynamic Parallelism kernels themselves start other kernels (grids).
https://developer.nvidia.com/blog/cuda-dynamic-parallelism-api-principles/
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-dynamic-parallelism
There is a maximum depth of 24.
The state of the parent grid could be swapped to memory (which could take a maximum of 860 MB per level, probably not for your program). But this could be a limitation.
All this swapping could make the parallel version slower again.
But the advantage would be that nodes can really be run in parallel.
Solution 4: Use Cuda Streams and Events (parallel)
Each kernel just calls one function. The synchronization and scheduling is done from Python. But the kernels run asynchronously and call a callback as soon as they are finished. Each kernel running in parallel has to be run on a separate stream.
Optimization: You can use the CUDA graph API, with which CUDA learns the order of the kernels and can do additional optimizations, when replaying (with possibly other float input data, but the same graph).
For all methods
You can try different launch configurations from 32 or better 64 threads per block up to 1024 threads per block.
Let's assume that most, or all, of your chunks of data are large; and that you have many distinct functions. If the former does not hold it's not clear you will even benefit from having them on a GPU in the first place. Let's also assume that the functions are black boxes to you, and you don't have the ability to identify fine-graines dependencies between individual values in your different buffers, with simple, local dependency functions.
Given these assumptions - your workload is basically the typical case of GPU work, which CUDA (and OpenCL) have catered for since their inception.
Traditional plain-vanilla approach
You define multiple streams (queues) of tasks; you schedule kernels on these streams for your various functions; and schedule event-fires and event-waits corresponding to your function's inter-dependency (or the buffer processing dependency). The event-waits before kernel launches ensure no buffer is processed until all preconditions have been satisfied. Then you have different CPU threads wait/synchronize with these streams, to get your work going.
Now, as far as the CUDA APIs go - this is bread-and-butter stuff. If you've read the CUDA Programming Guide, or at least the basic sections of it, you know how to do this. You could avail yourself of convenience libraries, like my API wrapper library, or if your workload fits, a higher-level offering such as NVIDIA Thrust might be more appropriate.
The multi-threaded synchronization is a bit less trivial, but this still isn't rocket-science. What is tricky and delicate is choosing how many streams to use and what work to schedule on what stream.
Using CUDA task graphs
With CUDA 10.x, NVIDIA add API functions for explicitly creating task graphs, with kernels and memory copies as nodes and edges for dependencies; and when you've completed the graph-construction API calls, you "schedule the task graph", so to speak, on any stream, and the CUDA runtime essentially takes care of what I've described above, automagically.
For an elaboration on how to do this, please read:
Getting Started with CUDA Graphs
on the NVIDIA developer blog. Or, for a deeper treatment - there's actually a section about them in the programming guide, and a small sample app using them, simpleCudaGraphs .
White-box functions
If you actually do know a lot about your functions, then perhaps you can create larger GPU kernels which perform some dependent processing, by keeping parts of intermediate results in registers or in block shared memory, and continuing to the part of a subsequent function applied to such local results. For example, if your first kernels does c[i] = a[i] + b[i] and your second kernel does e[i] = d[i] * e[i], you could instead write a kernel which performs the second action after the first, with inputs a,b,d (no need for c). Unfortunately I can't be less vague here, since your question was somewhat vague.

Why does printf() work within a kernel, but using std::cout doesn't?

I have been exploring the field of parallel programming and have written basic kernels in Cuda and SYCL. I have encountered a situation where I had to print inside the kernel and I noticed that std::cout inside the kernel does not work whereas printf works. For example, consider the following SYCL Codes -
This works -
void print(float*A, size_t N){
buffer<float, 1> Buffer{A, {N}};
queue Queue((intel_selector()));
Queue.submit([&Buffer, N](handler& Handler){
auto accessor = Buffer.get_access<access::mode::read>(Handler);
Handler.parallel_for<dummyClass>(range<1>{N}, [accessor](id<1>idx){
printf("%f", accessor[idx[0]]);
});
});
}
whereas if I replace the printf with std::cout<<accessor[idx[0]] it raises a compile time error saying - Accessing non-const global variable is not allowed within SYCL device code.
A similar thing happens with CUDA kernels.
This got me thinking that what may be the difference between printf and std::coout which causes such behavior.
Also suppose If I wanted to implement a custom print function to be called from the GPU, how should I do it?
TIA
This got me thinking that what may be the difference between printf and std::cout which causes such behavior.
Yes, there is a difference. The printf() which runs in your kernel is not the standard C library printf(). A different call is made, to an on-device function (the code of of which is closed, if it at all exists in CUDA C). That function uses a hardware mechanism on NVIDIA GPUs - a buffer for kernel threads to print into, which gets sent back over to the host side, and the CUDA driver then forwards it to the standard output file descriptor of the process which launched the kernel.
std::cout does not get this sort of a compiler-assisted replacement/hijacking - and its code is simply irrelevant on the GPU.
A while ago, I implemented an std::cout-like mechanism for use in GPU kernels; see this answer of mine here on SO for more information and links. But - I decided I don't really like it, and it compilation is rather expensive, so instead, I adapted a printf()-family implementation for the GPU, which is now part of the cuda-kat library (development branch).
That means I've had to answer your second question for myself:
If I wanted to implement a custom print function to be called from the GPU, how should I do it?
Unless you have access to undisclosed NVIDIA internals - the only way to do this is to use printf() calls instead of C standard library or system calls on the host side. You essentially need to modularize your entire stream over the low-level primitive I/O facilities. It is far from trivial.
In SYCL you cannot use std::cout for output on code not running on the host for similar reasons to those listed in the answer for CUDA code.
This means if you are running kernel code on the "device" (e.g. a GPU) then you need to use the stream class. There is more information about this in the SYCL developer guide section called Logging.
There is no __device__ version of std::cout, so only printf can be used in device code.

Are loops allowed in Linux's BPF programs?

I am thinking of a solution of replicating packets in the kernel and forward to 5 hosts (unicast). Planning to utilize eBPF/XDP for it.
I am trying to loop for 5 times, and inside the loop I am planning to clone the packet, modify the DST IP address, update the cksum and tx the packet out on the same intf it was received.
I read somewhere loops can't be used in XDP, so not sure if this will work ?
Need the expert's advice please.
Edit August 2022
The bpf_loop() BPF helper function is available starting with Linux 5.17 (commit), taking a number of iterations, a callback functions, and a context on which to run this callback in a loop within an eBPF program:
long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx, u64 flags)
Edit June 2019
Bounded loops have now landed to the kernel, and are available starting with Linux 5.3 (commit).
Original answer
No, at this moment loops are not allowed in eBPF programs. Back edges are not allowed, so that the kernel verifier can make sure that the programs terminate (and do not hang the kernel).
This might change in the future, as kernel developers are working on support for bounding loops.
There are two possible workarounds worth mentioning. Both suppose that you know how many times you would have to “loop” when writing your program.
First, there's an exception for back edges, regarding functions. Which means you can have functions, and call them multiple times. So you could have all the content that you would normally put in your loop in a separate function, and call this function as many times as you would loop.
The second thing is that you can actually write loops in your C code, and ask clang to unroll them at compile time. This looks like the following:
#pragma clang loop unroll(full)
for (i = 0; i < 4; i++) {
/* Do stuff ... */
}
This means that in the resulting object file the function will be unrolled, it will be replaced by the full series of instructions to be performed, with no actual backward jump.
There is no solution for a sequence with an arbitrary number of loops at the moment.
For Linux <5.3:
Technically, back-edges in the control flow graph of BPF bytecode programs are forbidden, not loops. Concretely, this means that you can write bounded loops in C, but you have to unroll them at compile time.
To unroll loop, you can use Clang's #pragma unroll directive. This should work for a 5-iterations loop, but won't for very long loops however.

A few questions about implicit synchronization?

In the cuda programming guide, it is mentioned that the following operations will cause implicit synchronization:
a page-locked host memory allocation
I'm wondering whether this include cudaHostRegister and cudaHostUnregister? If not then this shall imply that we can call malloc before all asynchronous operations, then in the asynchronous part we can do cudaHostRegister. Is this right?
any CUDA command to the default stream
Does this include any operations with cudaEvent, such as recording events on stream 0 or let stream 0 wait some events in other streams?
By the way, does the implicit synchronization happen within one device or will the synchronization be over all the devices?
I'm wondering whether this include cudaHostRegister and
cudaHostUnregister?
Yes, I believe there is implicit synchronization with these functions. But like I said in my comment above, these are slow functions. Use cudaHostAlloc() instead if you are able. If you're using shared memory or something like that which requires cudaHostRegister(), you'd generally want to take care of this just once near the beginning of your program and then just leave it registered.
Does this include any operations with cudaEvent, such as recording
events on stream 0 or let stream 0 wait some events in other streams?
Again, this is a CUDA call in the default stream, so I believe implicit synchronization is done here as well.
By the way, does the implicit synchronization happen within one device
or will the synchronization be over all the devices?
Synchronization only applies to the same device. It doesn't impact other devices.
Note that you can now create a stream which doesn't implicitly synchronize with the default stream using cudaStreamCreateWithFlags:
cudaStreamCreateWithFlags( &stream, cudaStreamNonBlocking );
There is something else that could be useful if your host code runs CUDA kernels on the same GPU from multiple host threads. CUDA 7.0 RC has a new nvcc option, --default-stream=per-thread, which you might want to look into. With this option, each host thread by default uses its own stream.
But if you're trying to optimize and check for implicit synchronizations, I'd start by using the CUDA profiler, nvvp, or the profiler which is part of NSight Eclipse Edition.

Difference between kernels construct and parallel construct

I study a lot of articles and the manual of OpenACC but still i don't understand the main difference of these two constructs.
kernels directive is the more general case and probably one that you might think of, if you've written GPU (e.g. CUDA) kernels before. kernels simply directs the compiler to work on a piece of code, and produce an arbitrary number of "kernels", of arbitrary "dimensions", to be executed in sequence, to parallelize/offload a particular section of code to the accelerator. The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator, for example by specifying specific dimensions of parallelization. For example, the number of workers and gangs would normally be constant as part of the parallel directive (since only one underlying "kernel" is usually implied), but perhaps not on the kernels directive (since it may translate to multiple underlying "kernels").
A good treatment of this specific question is contained in this PGI article.
Quoting from the article summary:
"The OpenACC kernels and parallel constructs each try to solve the same problem, identifying loop parallelism and mapping it to the machine parallelism. The kernels construct is more implicit, giving the compiler more freedom to find and map parallelism according to the requirements of the target accelerator. The parallel construct is more explicit, and requires more analysis by the programmer to determine when it is legal and appropriate. "
OpenACC directives and GPU kernels are just two ways of representing the same thing -- a section of code that can run in parallel.
OpenACC may be best when retrofitting an existing app to take advantage of a GPU and/or when it is desirable to let the compiler handle more details related to issues such as memory management. This can make it faster to write an app, with a potential cost in performance.
Kernels may be best when writing a GPU app from scratch and/or when more fine grained control is desired. This can make the app take longer to write, but may increase performance.
I think that people new to GPUs may be tempted to go with OpenACC because it looks more familiar. But I think it's actually better to go the other way, and start with writing kernels, and then, potentially move to OpenACC to save time in some projects. The reason is that OpenACC is a leaky abstraction. So, while OpenACC may make it look as if the GPU details are abstracted out, they are still there. So, using OpenACC to write GPU code without understanding what is happening in the background is likely to be frustrating, with odd error messages when attempting to compile, and result in an app that has low performance.
Parallel Construct
Defines the region of the program that should be compiled for parallel execution on the accelerator device.
The parallel loop directive is an assertion by the programmer that it is both safe and desirable to parallelize the affected loop. This relies on the programmer to have correctly identified parallelism in the code and remove anything in the code that may be unsafe to parallelize. If the programmer asserts incorrectly that the loop may be parallelized then the resulting application may produce incorrect results.
The parallel construct allows finer-grained control of how the compiler will attempt to structure work on the accelerator. So it does not rely heavily on the compiler’s ability to automatically parallelize the code.
When parallel loop is used on two subsequent loops that access the same data a compiler may or may not copy the data back and forth between the host and the device between the two loops.
More experienced parallel programmers, who may have already identified parallel loops within their code, will likely find the parallel loop approach more desirable.
e.g refer
#pragma acc parallel
{
#pragma acc loop
for (i=0; i<n; i++)
a[i] = 3.0f*(float)(i+1);
#pragma acc loop
for (i=0; i<n; i++)
b[i] = 2.0f*a[i];
}
 Generate one kernel
 There is no barrier between the two loops: the
second loop may start before the first loop ends.
(This is different from OpenMP).
Kernels Construct
Defines the region of the program that should be compiled into a sequence of kernels for execution on the accelerator device.
An important thing to note about the kernels construct is that the compiler will analyze the code and only parallelize when it is certain that it is safe to do so. In some cases, the compiler may not have enough information at compile time to determine whether a loop is safe the parallelize, in which case it will not parallelize the loop, even if the programmer can clearly see that the loop is safely parallel.
The kernels construct gives the compiler maximum leeway to parallelize and optimize the code how it sees fit for the target accelerator but also relies most heavily on the compiler’s ability to automatically parallelize the code.
One more notable benefit that the kernels construct provides is that if multiple loops access the same data it will only be copied to the accelerator once which may result in less data motion.
Programmers with less parallel programming experience or whose code contains a large number of loops that need to be analyzed may find the kernels approach much simpler, as it puts more of the burden on the compiler.
e.g refer
#pragma acc kernels
{
for (i=0; i<n; i++)
a[i] = 3.0f*(float)(i+1);
for (i=0; i<n; i++)
b[i] = 2.0f*a[i];
}
 Generate two kernels
 There is an implicit barrier between the two
loops: the second loop will start after the first
loop ends.