In CUDA programming how to understand if the GPU kernel has completed its task? - cuda

In CUDA programming suppose I am calling a kernel function from the host.
Suppose the kernel function is,
my_kernel_func(){
doing some tasks utilizing multiple threads
}
Now from the host I am calling it using,
my_kernel_func<<<grid,block>>>();
In the NVDIA examples, they have called three more functions afterwards,
cudaGetLastError()
CUDA Doc : Returns the last error that has been produced by any of the runtime calls in the same host thread and resets it to cudaSuccess.
cudaMemcpy()
CUDA Doc : Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind specifies the direction of the copy, and must be one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, cudaMemcpyDeviceToDevice, or cudaMemcpyDefault. Passing cudaMemcpyDefault is recommended, in which case the type of transfer is inferred from the pointer values. However, cudaMemcpyDefault is only allowed on systems that support unified virtual addressing. Calling cudaMemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.
and then
cudaDeviceSynchronize()
CUDA Doc : Blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.
Now I have tried to put a print statement at the end of the kernel function,
my_kernel_func(){
doing some tasks utilizing multiple threads
print D
}
As well as printed at different locations of the sequential flow,
cudaGetLastError()
print A
cudaMemcpy()
print B
cudaDeviceSynchronize()
print C
This thing prints in the following order
A
D
B
C
Basically, I need the time by which the kernel completes its task. Now I am confused to take the ending time. Because there should be a considerable time taken for copying back the data. Now if I put the ending time stamp after that it might incorporate the copying time also.
Is there any other function available to catch the ending?

As pointed out in the documentation, cudaMemcpy() exhibits synchronous behavior, so the cudaDeviceSynchronize() turns into a no-op because the synchronization was done at the memcpy.
The cudaGetLastError() checks whether you make an okay kernel call.
If you want to time for the kernel and not the memcpy's, switch the order of the cudaMemcpy()/cudaDeviceSynchronize() calls, start the timer just before the kernel call, then get the timer value after the cudaDeviceSynchronize() call. Make sure to test the result of cudaDeviceSynchronize() call as well.

Related

Is sort_by_key in thrust a blocking call?

I repeatedly enqueue a sequence of kernels:
for 1..100:
for 1..10000:
// Enqueue GPU kernels
Kernel 1 - update each element of array
Kernel 2 - sort array
Kernel 3 - operate on array
end
// run some CPU code
output "Waiting for GPU to finish"
// copy from device to host
cudaMemcpy ... D2H(array)
end
Kernel 3 is of order O(N^2) so is by far the slowest of all. For Kernel 2 I use thrust::sort_by_key directly on the device:
thrust::device_ptr<unsigned int> key(dKey);
thrust::device_ptr<unsigned int> value(dValue);
thrust::sort_by_key(key,key+N,value);
It seems that this call to thrust is blocking, as the CPU code only gets executed once the inner loop has finished. I see this because if I remove the call to sort_by_key, the host code (correctly) outputs the "Waiting" string before the inner loop finishes, while it does not if I run the sort.
Is there a way to call thrust::sort_by_key asynchronously?
First of all consider that there is a kernel launch queue, which can hold only so many pending launches. Once the launch queue is full, additional kernel launches, of any kind are blocking. The host thread will not proceed (beyond those launch requests) until empty queue slots become available. I'm pretty sure 10000 iterations of 3 kernel launches will fill this queue before it has reached 10000 iterations. So there will be some latency (I think) with any sort of non-trivial kernel launches if you are launching 30000 of them in sequence. (eventually, however, when all kernels are added to the queue because some have already completed, then you would see the "waiting..." message, before all kernels have actually completed, if there were no other blocking behavior.)
thrust::sort_by_key requires temporary storage (of a size approximately equal to your data set size). This temporary storage is allocated, each time you use it, via a cudaMalloc operation, under the hood. This cudaMalloc operation is blocking. When cudaMalloc is launched from a host thread, it waits for a gap in kernel activity before it can proceed.
To work around item 2, it seems there might be at least 2 possible approaches:
Provide a thrust custom allocator. Depending on the characteristics of this allocator, you might be able to eliminate the blocking cudaMalloc behavior. (but see discussion below)
Use cub SortPairs. The advantage here (as I see it - your example is incomplete) is that you can do the allocation once (assuming you know the worst-case temp storage size throughout the loop iterations) and eliminate the need to do a temporary memory allocation within your loop.
The thrust method (1, above) as far as I know, will still effectively do some kind of temporary allocation/free step at each iteration, even if you supply a custom allocator. If you have a well-designed custom allocator, it might be that this is almost a "no-op" however. The cub method appears to have the drawback of needing to know the max size (in order to completely eliminate the need for an allocation/free step), but I contend the same requirement would be in place for a thrust custom allocator. Otherwise, if you needed to allocate more memory at some point, the custom allocator is effectively going to have to do something like a cudaMalloc, which will throw a wrench in the works.

CUDA kernel launched after call to thrust is synchronous or asynchronous?

I am having some troubles with the results of my computations, for some reason they are not correct, I checked the code and it seems right (although I will check it again).
My question is if custom cuda kernels are synchronous or asynchronous after being launch after a call to thrust, e.g.
thrust::sort_by_key(args);
arrangeData<<<blocks,threads>>>(args);
will the kernel arrangeData run after thrust::sort has finished?
Assuming your code looks like that, and there is no usage of streams going on (niether the kernel call nor the thrust call indicate any stream usage as you have posted it), then both activities are issued to the default stream. I also assume (although it would not change my answer in this case) that the args passed to the thrust call are device arguments, not host arguments. (e.g. device_vector, not host_vector).
All CUDA API and kernel calls issued to the default stream (or any given single stream) will be executed in order.
The arrangeData kernel will not begin until any kernels launched by the thrust::sort_by_key call are complete.
You can verify this using a profiler, e.g. nvvp
Note that synchronous vs. asynchronous may be a bit confusing. When we talk about kernel launches being asynchronous, we are almost always referring to the host CPU activity, i.e. the kernel launch is asynchronous with respect to the host thread, which means it returns control to the host thread immediately, and its execution will occur at some unspecified time with respect to the host thread.
CUDA API calls and kernel calls issued to the same stream are always synchronous with respect to each other. A given kernel will not begin execution until all prior cuda activity issued to that stream (even things like cudaMemcpyAsync) has completed.

Would CPU continue executing the next line code when the GPU kernel is running? Would it cause an error?

I am reading the book "CUDA by example" by Sanders, where the author mentioned that p.441: For example, when we launched the kernel in our ray tracer, the GPU begins executing our code, but the CPU continues executing the next line of our program before the GPU finishes. -- Highlighted mar 4, 2014
I am wondering if this statement is correct. For example, what if the next instruction CPU continues executing depends on the variables that the GPU kernel outputs? Would it cause an error? From my experience, it does not cause an error. So what does the author really mean?
Many thanks!
Yes, the author is correct. Suppose my kernel launch looks like this:
int *h_in_data, *d_in_data, *h_out_data, *d_out_data;
// code to allocate host and device pointers, and initialize host data
...
// copy host data to device
cudaMemcpy(d_in_data, h_in_data, size_of_data, cudaMemcpyHostToDevice);
mykernel<<<grid, block>>>(d_in_data, d_out_data);
// some other host code happens here
// at this point, h_out_data does not point to valid data
...
cudaMemcpy(h_out_data, d_out_data, size_of_data, cudaMemcpyDeviceToHost);
//h_out_data now points to valid data
Immediately after the kernel launch, the CPU continues executing host code. But the data generated by the device (either d_out_data or h_out_data) is not ready yet. If the host code attempts to use whatever is pointed to by h_out_data, it will just be garbage data. This data only becomes valid after the 2nd cudaMemcpy operation.
Note that using the data (h_out_data) before the 2nd cudaMemcpy will not generate an error, if by that you mean a segmentation fault or some other run time error. But any results generated will not be correct.
Kernel launches in CUDA are by default asynchronous, i.e., the control will return to CPU after the launch. Now if the next instruction of the CPU is another kernel launch, then you don't need to worry, this launch will be done only after the previously launched kernel has finished its execution.
However, if the next instruction is some CPU instruction only which is accessing the results of the kernel, there can be a problem of accessing garbage value. Therefore, excessive care has to be taken and device synchronization should be done as and when needed.

Cuda: Kernel launch queue

I'm not finding much info on the mechanics of a kernel launch operation. The API say to see the CudaProgGuide. And I'm not finding much there either.
Being that kernel execution is asynch, and some machines support concurrent execution, I'm lead to believe there is a queue for the kernels.
Host code:
1. malloc(hostArry, ......);
2. cudaMalloc(deviceArry, .....);
3. cudaMemcpy(deviceArry, hostArry, ... hostToDevice);
4. kernelA<<<1,300>>>(int, int);
5. kernelB<<<10,2>>>(float, int));
6. cudaMemcpy(hostArry, deviceArry, ... deviceToHost);
7. cudaFree(deviceArry);
Line 3 is synchronous. Line 4 & 5 are asynchronous, and the machine supports concurrent execution. So at some point, both of these kernels are running on the GPU. (There is the possibility that kernelB starts and finishes, before kernelA finishes.) While this is happening, the host is executing line 6. Line 6 is synchronous with respect to the copy operation, but there is nothing preventing it from executing before kernelA or kernelB has finished.
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Yes, there are a variety of queues on the GPU, and the driver manages those.
Asynchronous calls return more or less immediately. Synchronous calls do not return until the operation is complete. Kernel calls are asynchronous. Most other CUDA runtime API calls are designated by the suffix Async if they are asynchronous. So to answer your question:
1) Is there a kernel queue in the GPU? (Does the GPU block/stall the host?)
There are various queues. The GPU blocks/stalls the host on a synchronous call, but the kernel launch is not a synchronous operation. It returns immediately, before the kernel has completed, and perhaps before the kernel has even started. When launching operations into a single stream, all CUDA operations in that stream are serialized. Therefore, even though kernel launches are asynchronous, you will not observed overlapped execution for two kernels launched to the same stream, because the CUDA subsystem guarantees that a given CUDA operation in a stream will not start until all previous CUDA operations in the same stream have finished. There are other specific rules for the null stream (the stream you are using if you don't explicitly call out streams in your code) but the preceding description is sufficient for understanding this question.
2) How does the host know that the kernel has finished, and it is "safe" to Xfer the results from the device to the host?
Since the operation that transfers results from the device to the host is a CUDA call (cudaMemcpy...), and it is issued in the same stream as the preceding operations, the device and CUDA driver manage the execution sequence of cuda calls so that the cudaMemcpy does not begin until all previous CUDA calls issued to the same stream have completed. Therefore a cudaMemcpy issued after a kernel call in the same stream is guaranteed not to start until the kernel call is complete, even if you use cudaMemcpyAsync.
You can use cudaDeviceSynchronize() after a kernel call to guarantee that all previous tasks requested to the device has been completed.
If the results of kernelB are independent from the results on kernelA, you can set this function right before the memory copy operation. If not, you will need to block the device before calling kernelB, resulting in two blocking operations.

When to call cudaDeviceSynchronize?

when is calling to the cudaDeviceSynchronize function really needed?.
As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize after each kernel launch. However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize, except one before the time measurement. I have found that I get the same result but with a speed up between 7-12x (depending on the matrix sizes).
So, the question is if there are any reasons to use cudaDeviceSynchronize apart of time measurement.
For example:
Is it needed before copying data from the GPU back to the host with cudaMemcpy?
If I do matrix multiplications like
C = A * B
D = C * F
should I put cudaDeviceSynchronize between both?
From my experiment It seems that I don't.
Why does cudaDeviceSynchronize slow the program so much?
Although CUDA kernel launches are asynchronous, all GPU-related tasks placed in one stream (which is the default behavior) are executed sequentially.
So, for example,
kernel1<<<X,Y>>>(...); // kernel start execution, CPU continues to next statement
kernel2<<<X,Y>>>(...); // kernel is placed in queue and will start after kernel1 finishes, CPU continues to next statement
cudaMemcpy(...); // CPU blocks until memory is copied, memory copy starts only after kernel2 finishes
So in your example, there is no need for cudaDeviceSynchronize. However, it might be useful for debugging to detect which of your kernel has caused an error (if there is any).
cudaDeviceSynchronize may cause some slowdown, but 7-12x seems too much. Might be there is some problem with time measurement, or maybe the kernels are really fast, and the overhead of explicit synchronization is huge relative to actual computation time.
One situation where using cudaDeviceSynchronize() is appropriate would be when you have several cudaStreams running, and you would like to have them exchange some information. A real-life case of this is parallel tempering in quantum Monte Carlo simulations. In this case, we would want to ensure that every stream has finished running some set of instructions and gotten some results before they start passing messages to each other, or we would end up passing garbage information. The reason using this command slows the program so much is that cudaDeviceSynchronize() forces the program to wait for all previously issued commands in all streams on the device to finish before continuing (from the CUDA C Programming Guide). As you said, kernel execution is normally asynchronous, so while the GPU device is executing your kernel the CPU can continue to work on some other commands, issue more instructions to the device, etc., instead of waiting. However when you use this synchronization command, the CPU is instead forced to idle until all the GPU work has completed before doing anything else. This behaviour is useful when debugging, since you may have a segfault occuring at seemingly "random" times because of the asynchronous execution of device code (whether in one stream or many). cudaDeviceSynchronize() will force the program to ensure the stream(s)'s kernels/memcpys are complete before continuing, which can make it easier to find out where the illegal accesses are occuring (since the failure will show up during the sync).
When you want your GPU to start processing some data, you typically do a kernal invocation.
When you do so, your device (The GPU) will start to doing whatever it is you told it to do. However, unlike a normal sequential program on your host (The CPU) will continue to execute the next lines of code in your program. cudaDeviceSynchronize makes the host (The CPU) wait until the device (The GPU) have finished executing ALL the threads you have started, and thus your program will continue as if it was a normal sequential program.
In small simple programs you would typically use cudaDeviceSynchronize, when you use the GPU to make computations, to avoid timing mismatches between the CPU requesting the result and the GPU finising the computation. To use cudaDeviceSynchronize makes it alot easier to code your program, but there is one major drawback: Your CPU is idle all the time, while the GPU makes the computation. Therefore, in high-performance computing, you often strive towards having your CPU making computations while it wait for the GPU to finish.
You might also need to call cudaDeviceSynchronize() after launching kernels from kernels (Dynamic Parallelism).
From this post CUDA Dynamic Parallelism API and Principles:
If the parent kernel needs results computed by the child kernel to do its own work, it must ensure that the child grid has finished execution before continuing by explicitly synchronizing using cudaDeviceSynchronize(void). This function waits for completion of all grids previously launched by the thread block from which it has been called. Because of nesting, it also ensures that any descendants of grids launched by the thread block have completed.
...
Note that the view of global memory is not consistent when the kernel launch construct is executed. That means that in the following code example, it is not defined whether the child kernel reads and prints the value 1 or 2. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.
__device__ int v = 0;
__global__ void child_k(void) {
printf("v = %d\n", v);
}
__global__ void parent_k(void) {
v = 1;
child_k <<< 1, 1 >>>> ();
v = 2; // RACE CONDITION
cudaDeviceSynchronize();
}