Can a CUDA event be fired from device-side code? - cuda

Is there any way to fire an event (for benchmarking purposes, similar to cudaEvents in the CPU code) from a device kernel in CUDA?
E.g. suppose I would like to measure the time passed from kernel start to the first thread ever that starts a computation and the time passed from the last thread that leaves the computation to the CPU return.
Can I do that?

The device runtime API (used with dynamic parallelism) does have limited stream and events support, but event timing is not supported.
So, no you can't do that.

An ugly workaround would be writing to some managed-memory location, and having a host-side thread poll it and fire the event when the value changes.

Related

Wait until *any* device has finished in CUDA?

I have a CUDA kernel that I want to run across multiple GPUs. On each GPU, it's performing a search task, so I'd like to launch it on each GPU and then in the host code wait until any of the GPUs returns (indicating that it found what it was looking for).
I know about cudaDeviceSynchronize(), but that blocks until the current GPU is finished. Is there anything that will let me block until any one of N different GPUs finishes?
CUDA doesn't provide any built-in functions to accomplish that directly.
I believe you would need to do something via polling, and then if you want to poll the results, you can. If you want to build something that blocks the CPU thread, I guess a spin on the polling operation would do it. (cudaDeviceSynchronize() is by default a spin operation under the hood)
You could build a polling system using various ideas:
cudaEvent - launch an event after each kernel launch, then use cudaEventQuery() operations to poll
cudaHostAlloc - use host-pinned memory that each kernel can update with status - read the memory directly
cudaLaunchHostFunc - put a callback in place after each kernel launch. The callback host function would update ordinary host memory, which you could poll for status.
The callback method (at least) would allow you (perhaps via atomics) to collapse the polling to a single memory location, if that were important for some reason. You could probably implement something similar using the host-pinned memory method for systems that have CUDA system atomic support.

cudaElapsedTime with non-default streams

My question is about the use of the funcion cudaEventElapsedTime to measure the execution time in a multi-stream application.
According to CUDA documentation
If either event was last recorded in a non-NULL stream, the resulting time may be greater than expected (even if both used the same stream handle).This happens because the cudaEventRecord() operation takes place asynchronously and there is no guarantee that the measured latency is actually just between the two events. Any number of other different stream operations could execute in between the two measured events, thus altering the timing in a significant way.
I am genuinely struggling to understand the sentences in bold in the above. It seems, it is more accurate to measure the time using the default stream. But I want to understand why? If i want to measure the execution time in a stream, i find it more logical to attach the start/stop events by that stream instead of the default stream. Any clarification, please? Thank you
First of all let's remember basic CUDA stream semantics:
CUDA activity issued into the same stream will always execute in issue order.
There is no defined relationship between the order of execution of CUDA activities issued into separate streams.
The CUDA default stream (assuming we have not overridden the default legacy behavior) has an additional characteristic of implicit synchronization, which roughly means that a CUDA operation issued into the default stream will not begin executing until all prior issued CUDA activity to that device has completed.
Therefore, if we issue 2 CUDA events (say, start and stop) into the legacy default stream, we can be confident that any and all CUDA activity issued between those two issue points will be timed (regardless of which stream they were issued into, or which host thread they were issued from). I would suggest for casual usage this is intuitive, and less likely to be misinterpreted. Furthermore, it should yield consistent timing behavior, run-to-run (assuming host thread behavior is the same, i.e. somehow synchronized).
OTOH, let's say we have a multi-streamed application. Let's assume that we are issuing kernels into 2 or more non-default streams:
Stream1: cudaEventRecord(start)|Kernel1|Kernel2|cudaEventRecord(stop)
Stream2: |Kernel3|
It does not really matter too much whether these were issued from the same host thread or from separate host threads. For example, let's say our single host thread activity looked like this (condensed):
cudaEventRecord(start, Stream1);
Kernel1<<<..., Stream1>>>(...);
Kernel2<<<..., Stream1>>>(...);
Kernel3<<<..., Stream2>>>(...);
cudaEventRecord(stop, Stream1);
What timing should we expect? Will Kernel3 be included in the elapsed time between start and stop?
In fact the answer is unknown, and could vary from run-to-run, and probably would depend on what else is happening on the device before and during the above activity.
For the above issue order, and assuming we have no other activity on the device, we can assume that immediately after the cudaEventRecord(start) operation, that the Kernel1 will launch and begin executing. Let's suppose it "fills the device" so that no other kernels can execute concurrently. Let's also assume that the duration of Kernel1 is much longer than the launch latency of Kernel2 and Kernel3. Therefore, while Kernel1 is executing, both Kernel2 and Kernel3 are queued for execution. At the completion of Kernel1, the device scheduler has the option of beginning either Kernel2 or Kernel3. If it chooses Kernel2 then at the completion of Kernel2 it can mark the stop event as completed, which will establish the time duration between start and stop as the duration of Kernel1 and Kernel2, approximately.
Device Execution: event(start)|Kernel1|Kernel2|event(stop)|Kernel3|
| Duration |
However, if the scheduler chooses to begin Kernel3 before Kernel2 (an entirely legal and valid choice based on the stream semantics) then the stop event cannot be marked as complete until Kernel2 finishes, which means the measured duration will now included the duration of Kernel1 plus Kernel2 plus Kernel3. There is nothing in the CUDA programming model to sort this out, which means the measured timing could alternate even run-to-run:
Device Execution: event(start)|Kernel1|Kernel3|Kernel2|event(stop)|
| Duration |
Furthermore, we could considerably alter the actual issue order, placing the issue/launch of Kernel3 before the first cudaEventRecord or after the last cudaEventRecord, and the above argument/variability still holds. This is where the meaning of the asynchronous nature of the cudaEventRecord call comes in. It does not block the CPU thread, but like a kernel launch it is asynchronous. Therefore all of the above activity can issue before any of it actually begins to execute on the device. Even if Kernel3 begins executing before the first cudaEventRecord, it will occupy the device for some time, delaying the beginning of execution of Kernel1, and therefore increasing the measured duration by some amount.
And if the Kernel3 is issued even after the last cudaEventRecord, because all these issue operations are asynchronous, the Kernel3 may still be queued up and ready to go when Kernel1 is complete, meaning the device scheduler can still make a choice about which to launch, making for possibly variable timing.
There are certainly other similar hazards that can be mapped out. This sort of possibility for variation in a multi-streamed scenario is what gives rise to the conservative advice to avoid trying to do cudaEvent based timing using events issued into the non-legacy-default stream.
Of course, if you for example use the visual profiler then there should be relatively little ambiguity about what was measured between two events (although it may still vary run-to-run). However, if you're going to use the visual profiler, you can read the duration directly off the timeline view, without needing an event elapsed time call.
Note that if you override the default stream legacy behavior, the default stream roughly becomes equivalent to an "ordinary" stream (especially for a single-threaded host application). In this case, we can't rely on the default stream semantics to sort this out. One possible option might be to precede any cudaEventRecord() call with a cudaDeviceSynchronize() call. I'm not suggesting this sorts out every possible scenario, but for single-device single host-thread applications, it should be equivalent to cudaEvent timing issued into default legacy stream.
Complex scenario timing might be best done using a profiler. Many folks also dispense entirely with cudaEvent based timing and revert to high-resolution host timing methodologies. In any event, the timing of a complex concurrent asynchronous system is non-trivial. The conservative advice intends to avoid some of these issues for casual use.

How to uninitialise CUDA?

CUDA implicitly initialises when the first CUDA runtime function is called.
I'm timing the runtime of my code and repeating 100 times via a loop (for([100 times]) {[Time CUDA code and log]}), which also needs to take into account the initialisation time for CUDA at each iteration. Thus I need to uninitialise CUDA after every iteration - how to do this?
I've tried using cudaDeviceReset(), but seems not to have uninitialised CUDA.
Many thanks.
cudaDeviceReset is the canonical way to destroy a context in the runtime API (and calling cudaFree(0) is the canonical way to create a context). Those are the only levels of "re-initialization" available to a running process. There are other per-process events which happen when a process loads the CUDA driver and runtime libraries and connects to the kernel driver, but there is no way I am aware of to make those happen programatically short of forking a new process.
But I really doubt you want or should be needing to account for this sort of setup time when calculating performance metrics anyway.

How to calculate total time for CPU + GPU

I am doing some computation on the CPU and then I transfer the numbers to the GPU and do some work there. I want to calculate the total time taken to do the computation on the CPU + the GPU. how do i do so?
When your program starts, in main(), use any system timer to record the time. When your program ends at the bottom of main(), use the same system timer to record the time. Take the difference between time2 and time1. There you go!
There are different system timers you can use, some with higher resolution than others. Rather than discuss those here, I'd suggest you search for "system timer" on the SO site. If you just want any system timer, gettimeofday() works on Linux systems, but it has been superseded by newer, higher-precision functions. As it is, gettimeofday() only measures time in microseconds, which should be sufficient for your needs.
If you can't get a timer with good enough resolution, consider running your program in a loop many times, timing the execution of the loop, and dividing the measured time by the number of loop iterations.
EDIT:
System timers can be used to measure total application performance, including time used during the GPU calculation. Note that using system timers in this way applies only to real, or wall-clock, time, rather than process time. Measurements based on the wall-clock time must include time spent waiting for GPU operations to complete.
If you want to measure the time taken by a GPU kernel, you have a few options. First, you can use the Compute Visual Profiler to collect a variety of profiling information, and although I'm not sure that it reports time, it must be able to (that's a basic profiling function). Other profilers - PAPI comes to mind - offer support for CUDA kernels.
Another option is to use CUDA events to record times. Please refer to the CUDA 4.0 Programming Guide where it discusses using CUDA events to measure time.
Yet another option is to use system timers wrapped around GPU kernel invocations. Note that, given the asynchronous nature of kernel invocation returns, you will also need to follow the kernel invocation with a host-side GPU synchronization call such as cudaThreadSynchronize() for this method to be applicable. If you go with this option, I highly recommend calling the kernel in a loop, timing the loop + one synchronization at the end (since synchronization occurs between kernel calls not executing in different streams, cudaThreadSynchronize() is not needed inside the loop), and dividing by the number of iterations.
The C timer moves on regardless of GPU is working or not. If you don't believe me then do this little experiment: Make a for loop with 1000 iterations over GPU_Function_Call. Put any C timer around that for loop. Now when you run the program (suppose GPU function takes substantial time like 20ms) you will see it running for few seconds with the naked eye before it returns. But when you print the C time you'll notice it'll show you like few miliseconds. This is because the C timer didn't wait for 1000 MemcpyHtoD and 1000 MemcpyfromDtoH and 1000 kernel calls.
What I suggest is to use CUDA event timer or even better NVIDIA Visual Profiler to time GPU and use stop watch (increase the iterations to reduce human error) to measure the complete time. Then just subtract the GPU time from total to get the CPU time.

CPU and GPU timer in cuda visual profiler

So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program; /bin/time myprog, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.
If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance