CUDA graph stream capture with thrust::reduce - cuda

When I am trying to capture stream execution to build CUDA graph, call to thrust::reduce causes a runtime error cudaErrorStreamCaptureUnsupported: operation not permitted when stream is capturing. I have tried returning the reduction result to both host and device variables, and I am calling reduction in a proper stream by the means of thrust::cuda::par.on(stream). Is there any way I can add thrust functions execution to CUDA graphs?

Thrust's reduction operation is a blocking operation on the host side. I am assuming that you are using the result of reduction as a parameter to one of your following kernels. So that when you are capturing a CUDA graph, it cannot instantiate the graph executable because you are dependent on a variable that is on the host side but not available until the reduction kernel finishes execution. As a solution, you can try adding a host node to your graph that returns the result of the reduction.

Related

In CUDA programming how to understand if the GPU kernel has completed its task?

In CUDA programming suppose I am calling a kernel function from the host.
Suppose the kernel function is,
my_kernel_func(){
doing some tasks utilizing multiple threads
}
Now from the host I am calling it using,
my_kernel_func<<<grid,block>>>();
In the NVDIA examples, they have called three more functions afterwards,
cudaGetLastError()
CUDA Doc : Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess.
cudaMemcpy()
CUDA Doc : Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.
and then
cudaDeviceSynchronize()
CUDA Doc : Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.
Now I have tried to put a print statement at the end of the kernel function,
my_kernel_func(){
doing some tasks utilizing multiple threads
print D
}
As well as printed at different locations of the sequential flow,
cudaGetLastError()
print A
cudaMemcpy()
print B
cudaDeviceSynchronize()
print C
This thing prints in the following order
A
D
B
C
Basically, I need the time by which the kernel completes its task. Now I am confused to take the ending time. Because there should be a considerable time taken for copying back the data. Now if I put the ending time stamp after that it might incorporate the copying time also.
Is there any other function available to catch the ending?
As pointed out in the documentation, cudaMemcpy() exhibits synchronous behavior, so the cudaDeviceSynchronize() turns into a no-op because the synchronization was done at the memcpy.
The cudaGetLastError() checks whether you make an okay kernel call.
If you want to time for the kernel and not the memcpy's, switch the order of the cudaMemcpy()/cudaDeviceSynchronize() calls, start the timer just before the kernel call, then get the timer value after the cudaDeviceSynchronize() call. Make sure to test the result of cudaDeviceSynchronize() call as well.

How do the nodes in a CUDA graph connect?

CUDA graphs are a new way to synthesize complex operations from multiple operations. With "stream capture", it appears that you can run a mix of operations, including CuBlas and similar library operations and capture them as a singe "meta-kernel".
What's unclear to me is how the data flow works for these graphs. In the capture phase, I allocate memory A for the input, memory B for the temporary values, and memory C for the output. But when I capture this in a graph, I don't capture the memory allocations. So when I then instantiate multiple copies of these graphs, they cannot share the input memory A, temporary workspace B or output memory C.
How then does this work? I.e. when I call cudaGraphLaunch, I don't see a way to provide input parameters. My captured graph basically starts with a cudaMemcpyHostToDevice, how does the graph know which host memory to copy and where to put it?
Background: I found that CUDA is heavily bottlenecked on kernel launches; my AVX2 code was 13x times slower when ported to CUDA. The kernels themselves seem fine (according to NSight), it's just the overhead of scheduling several hundred thousand kernel launches.
A memory allocation would typically be done outside of a graph definition/instantiation or "capture".
However, graphs provide for "memory copy" nodes, where you would typically do cudaMemcpy type operations.
At the time of graph definition, you pass a set of arguments for each graph node (which will depend on the node type, e.g. arguments for the cudaMemcpy operation, if it is a memory copy node, or kernel arguments if it is a kernel node). These arguments determine the actual memory allocations that will be used when that graph is executed.
If you wanted to use a different set of allocations, one method would be to instantiate another graph with different arguments for the nodes where there are changes. This could be done by repeating the entire process, or by starting with an existing graph, making changes to node arguments, and then instantiating a graph with those changes.
Currently, in cuda graphs, it is not possible to perform runtime binding (i.e. at the point of graph "launch") of node arguments to a particular graph/node. It's possible that new features may be introduced in future releases, of course.
Note that there is a CUDA sample code called simpleCudaGraphs available in CUDA 10 which demonstrates the use of both memory copy nodes, and kernel nodes, and also how to create dependencies (effectively execution dependencies) between nodes.

CUDA kernel launched after call to thrust is synchronous or asynchronous?

I am having some troubles with the results of my computations, for some reason they are not correct, I checked the code and it seems right (although I will check it again).
My question is if custom cuda kernels are synchronous or asynchronous after being launch after a call to thrust, e.g.
thrust::sort_by_key(args);
arrangeData<<<blocks,threads>>>(args);
will the kernel arrangeData run after thrust::sort has finished?
Assuming your code looks like that, and there is no usage of streams going on (niether the kernel call nor the thrust call indicate any stream usage as you have posted it), then both activities are issued to the default stream. I also assume (although it would not change my answer in this case) that the args passed to the thrust call are device arguments, not host arguments. (e.g. device_vector, not host_vector).
All CUDA API and kernel calls issued to the default stream (or any given single stream) will be executed in order.
The arrangeData kernel will not begin until any kernels launched by the thrust::sort_by_key call are complete.
You can verify this using a profiler, e.g. nvvp
Note that synchronous vs. asynchronous may be a bit confusing. When we talk about kernel launches being asynchronous, we are almost always referring to the host CPU activity, i.e. the kernel launch is asynchronous with respect to the host thread, which means it returns control to the host thread immediately, and its execution will occur at some unspecified time with respect to the host thread.
CUDA API calls and kernel calls issued to the same stream are always synchronous with respect to each other. A given kernel will not begin execution until all prior cuda activity issued to that stream (even things like cudaMemcpyAsync) has completed.

CUDA overlap of data transfer and kernel execution, implicit synchronization for streams

After reading CUDA's "overlap of data transfer and kernel execution" section in "CUDA C Programming Guide", I have a question: what exactly does data transfer refers to? Does it include cudaMemsetAsync, cudaMemcpyAsync, cudaMemset, cudaMemcpy. Of course, the memory allocated for memcpy is pinned.
In the implicit synchronization (streams) section, the book says "a device memory set" may serialize the streams. So, does it refer to cudaMemsetAsync, cudaMemcpyAsync, cudaMemcpy, cudaMemcpy? I am not sure.
Any function call with an Async at the end has a stream parameter. Additionally, some of the libraries provided by the CUDA toolkit also have the option of setting a stream. By using this, you can have multiple streams running concurrently.
This means, unless you specifically create and set a stream, you will be using the defualt stream. For example, there are no default data transfer and kernel execution streams. You will have to create two streams (or more), and allocate them a task of choice.
A common use case is to have the two streams as mentioned in the programming guide. Keep in mind, this is only useful if you have multiple kernel launches. You can get the data needed for the next (independent) kernel or the next iteration of the current kernel while computing the results for the current kernel. This can maximize both compute and bandwidth capabilities.
For the function calls you mention, cudaMemcpy and cudaMemcpyAsync are the only functions performing data transfers. I don't think cudaMemset and cudaMemsetAsync can be termed as data transfers.
Both cudaMempyAsync and cudaMemsetAsync can be used with streams, while cudaMemset and cudaMemcpy are blocking calls that do not make use of streams.

Cuda: Kernel launch queue

I'm not finding much info on the mechanics of a kernel launch operation. The API say to see the CudaProgGuide. And I'm not finding much there either.
Being that kernel execution is asynch, and some machines support concurrent execution, I'm lead to believe there is a queue for the kernels.
Host code:
1. malloc(hostArry, ......);
2. cudaMalloc(deviceArry, .....);
3. cudaMemcpy(deviceArry, hostArry, ... hostToDevice);
4. kernelA<<<1,300>>>(int, int);
5. kernelB<<<10,2>>>(float, int));
6. cudaMemcpy(hostArry, deviceArry, ... deviceToHost);
7. cudaFree(deviceArry);
Line 3 is synchronous. Line 4 & 5 are asynchronous, and the machine supports concurrent execution. So at some point, both of these kernels are running on the GPU. (There is the possibility that kernelB starts and finishes, before kernelA finishes.) While this is happening, the host is executing line 6. Line 6 is synchronous with respect to the copy operation, but there is nothing preventing it from executing before kernelA or kernelB has finished.
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Yes, there are a variety of queues on the GPU, and the driver manages those.
Asynchronous calls return more or less immediately. Synchronous calls do not return until the operation is complete. Kernel calls are asynchronous. Most other CUDA runtime API calls are designated by the suffix Async if they are asynchronous. So to answer your question:
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
There are various queues. The GPU blocks/stalls the host on a synchronous call, but the kernel launch is not a synchronous operation. It returns immediately, before the kernel has completed, and perhaps before the kernel has even started. When launching operations into a single stream, all CUDA operations in that stream are serialized. Therefore, even though kernel launches are asynchronous, you will not observed overlapped execution for two kernels launched to the same stream, because the CUDA subsystem guarantees that a given CUDA operation in a stream will not start until all previous CUDA operations in the same stream have finished. There are other specific rules for the null stream (the stream you are using if you don't explicitly call out streams in your code) but the preceding description is sufficient for understanding this question.
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Since the operation that transfers results from the device to the host is a CUDA call (cudaMemcpy...), and it is issued in the same stream as the preceding operations, the device and CUDA driver manage the execution sequence of cuda calls so that the cudaMemcpy does not begin until all previous CUDA calls issued to the same stream have completed. Therefore a cudaMemcpy issued after a kernel call in the same stream is guaranteed not to start until the kernel call is complete, even if you use cudaMemcpyAsync.
You can use cudaDeviceSynchronize() after a kernel call to guarantee that all previous tasks requested to the device has been completed.
If the results of kernelB are independent from the results on kernelA, you can set this function right before the memory copy operation. If not, you will need to block the device before calling kernelB, resulting in two blocking operations.