How to calculate total time for CPU + GPU - cuda

I am doing some computation on the CPU and then I transfer the numbers to the GPU and do some work there. I want to calculate the total time taken to do the computation on the CPU + the GPU. how do i do so?

When your program starts, in main(), use any system timer to record the time. When your program ends at the bottom of main(), use the same system timer to record the time. Take the difference between time2 and time1. There you go!
There are different system timers you can use, some with higher resolution than others. Rather than discuss those here, I'd suggest you search for "system timer" on the SO site. If you just want any system timer, gettimeofday() works on Linux systems, but it has been superseded by newer, higher-precision functions. As it is, gettimeofday() only measures time in microseconds, which should be sufficient for your needs.
If you can't get a timer with good enough resolution, consider running your program in a loop many times, timing the execution of the loop, and dividing the measured time by the number of loop iterations.
EDIT:
System timers can be used to measure total application performance, including time used during the GPU calculation. Note that using system timers in this way applies only to real, or wall-clock, time, rather than process time. Measurements based on the wall-clock time must include time spent waiting for GPU operations to complete.
If you want to measure the time taken by a GPU kernel, you have a few options. First, you can use the Compute Visual Profiler to collect a variety of profiling information, and although I'm not sure that it reports time, it must be able to (that's a basic profiling function). Other profilers - PAPI comes to mind - offer support for CUDA kernels.
Another option is to use CUDA events to record times. Please refer to the CUDA 4.0 Programming Guide where it discusses using CUDA events to measure time.
Yet another option is to use system timers wrapped around GPU kernel invocations. Note that, given the asynchronous nature of kernel invocation returns, you will also need to follow the kernel invocation with a host-side GPU synchronization call such as cudaThreadSynchronize() for this method to be applicable. If you go with this option, I highly recommend calling the kernel in a loop, timing the loop + one synchronization at the end (since synchronization occurs between kernel calls not executing in different streams, cudaThreadSynchronize() is not needed inside the loop), and dividing by the number of iterations.

The C timer moves on regardless of GPU is working or not. If you don't believe me then do this little experiment: Make a for loop with 1000 iterations over GPU_Function_Call. Put any C timer around that for loop. Now when you run the program (suppose GPU function takes substantial time like 20ms) you will see it running for few seconds with the naked eye before it returns. But when you print the C time you'll notice it'll show you like few miliseconds. This is because the C timer didn't wait for 1000 MemcpyHtoD and 1000 MemcpyfromDtoH and 1000 kernel calls.
What I suggest is to use CUDA event timer or even better NVIDIA Visual Profiler to time GPU and use stop watch (increase the iterations to reduce human error) to measure the complete time. Then just subtract the GPU time from total to get the CPU time.

Related

Getting total execution time of all kernels on a CUDA stream

I know how to time the execution of one CUDA kernel using CUDA events, which is great for simple cases. But in the real world, an algorithm is often made up of a series of kernels (CUB::DeviceRadixSort algorithms, for example, launch many kernels to get the job done). If you're running your algorithm on a system with a lot of other streams and kernels also in flight, it's not uncommon for the gaps between individual kernel launches to be highly variable based on what other work gets scheduled in-between launches on your stream. If I'm trying to make my algorithm work faster, I don't care so much about how long it spends sitting waiting for resources. I care about the time it spends actually executing.
So the question is, is there some way to do something like the event API and insert a marker in the stream before the first kernel launches, and read it back after your last kernel launches, and have it tell you the actual amount of time spent executing on the stream, rather than the total end-to-end wall-clock time? Maybe something in CUPTI can do this?
You can use Nsight Systems or Nsight Compute.
(https://developer.nvidia.com/tools-overview)
In Nsight Systems, you can profile timelines of each stream. Also, you can use Nsight Compute to profile details of each CUDA kernel. I guess Nsight Compute is better because you can inspect various metrics about GPU performances and get hints about the kernel optimization.

How to profile the CUDA application only by nvprof

I want to write a script to profile my cuda application only using the command tool nvprof. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).
GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by nvprof --print-gpu-trace, while the elapsed time (without overhead) of the application is not clear for me. I use visual profiler nvvp to visualize the profiling results and calculate the GPU utilization. It seems that the elapsed time is the interval between the first and last API call, including the overhead time.
GPU flops32 is the number of FP32 instructions GPU executes per second while it is active. I follow Greg Smith's suggestion (How to calculate Gflops of a kernel) and find that it is very slow for nvprof to generate flop_count_sp_* metrics.
So there are two questions that I want to ask:
How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
Is there a faster way to obtain the gpu flops32?
Any suggestion would be appreciated.
================ Update =======================
For the first question above, the elapsed time without overhead which I meant is actually session time - overhead time showed in nvvp results:
nvvp results
You can use nVIDIA's NVTX library to programmatically mark named ranges or points on your timeline. The length of such a range, properly defined, would constitute your "elapsed time", and would show up very clearly in the nvvp visualization tool. Here is a "CUDA pro tip" blog post about doing this:
CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
and if you want to do this in a more C++-friendly and RAII way, you can use my CUDA runtime API wrappers, which offer a scoped range marker and other utility functions. Of course, with me being the author, take my recommendation with a grain of salt and see what works for you.
About the "Elapsed time" for the session - that's the time between when you start and stop profiling activity. That can either be when the process comes up, or when you explicitly have profiling start. In my own API wrappers, there's a RAII class for that as well: cuda::profiling::scope or of course you can use the C-style API calls explicitly. (I should really write a sample program doing this, I haven't gotten around to that yet, unfortunately).

cudaElapsedTime with non-default streams

My question is about the use of the funcion cudaEventElapsedTime to measure the execution time in a multi-stream application.
According to CUDA documentation
If either event was last recorded in a non-NULL stream, the resulting time may be greater than expected (even if both used the same stream handle).This happens because the cudaEventRecord() operation takes place asynchronously and there is no guarantee that the measured latency is actually just between the two events. Any number of other different stream operations could execute in between the two measured events, thus altering the timing in a significant way.
I am genuinely struggling to understand the sentences in bold in the above. It seems, it is more accurate to measure the time using the default stream. But I want to understand why? If i want to measure the execution time in a stream, i find it more logical to attach the start/stop events by that stream instead of the default stream. Any clarification, please? Thank you
First of all let's remember basic CUDA stream semantics:
CUDA activity issued into the same stream will always execute in issue order.
There is no defined relationship between the order of execution of CUDA activities issued into separate streams.
The CUDA default stream (assuming we have not overridden the default legacy behavior) has an additional characteristic of implicit synchronization, which roughly means that a CUDA operation issued into the default stream will not begin executing until all prior issued CUDA activity to that device has completed.
Therefore, if we issue 2 CUDA events (say, start and stop) into the legacy default stream, we can be confident that any and all CUDA activity issued between those two issue points will be timed (regardless of which stream they were issued into, or which host thread they were issued from). I would suggest for casual usage this is intuitive, and less likely to be misinterpreted. Furthermore, it should yield consistent timing behavior, run-to-run (assuming host thread behavior is the same, i.e. somehow synchronized).
OTOH, let's say we have a multi-streamed application. Let's assume that we are issuing kernels into 2 or more non-default streams:
Stream1: cudaEventRecord(start)|Kernel1|Kernel2|cudaEventRecord(stop)
Stream2: |Kernel3|
It does not really matter too much whether these were issued from the same host thread or from separate host threads. For example, let's say our single host thread activity looked like this (condensed):
cudaEventRecord(start, Stream1);
Kernel1<<<..., Stream1>>>(...);
Kernel2<<<..., Stream1>>>(...);
Kernel3<<<..., Stream2>>>(...);
cudaEventRecord(stop, Stream1);
What timing should we expect? Will Kernel3 be included in the elapsed time between start and stop?
In fact the answer is unknown, and could vary from run-to-run, and probably would depend on what else is happening on the device before and during the above activity.
For the above issue order, and assuming we have no other activity on the device, we can assume that immediately after the cudaEventRecord(start) operation, that the Kernel1 will launch and begin executing. Let's suppose it "fills the device" so that no other kernels can execute concurrently. Let's also assume that the duration of Kernel1 is much longer than the launch latency of Kernel2 and Kernel3. Therefore, while Kernel1 is executing, both Kernel2 and Kernel3 are queued for execution. At the completion of Kernel1, the device scheduler has the option of beginning either Kernel2 or Kernel3. If it chooses Kernel2 then at the completion of Kernel2 it can mark the stop event as completed, which will establish the time duration between start and stop as the duration of Kernel1 and Kernel2, approximately.
Device Execution: event(start)|Kernel1|Kernel2|event(stop)|Kernel3|
| Duration |
However, if the scheduler chooses to begin Kernel3 before Kernel2 (an entirely legal and valid choice based on the stream semantics) then the stop event cannot be marked as complete until Kernel2 finishes, which means the measured duration will now included the duration of Kernel1 plus Kernel2 plus Kernel3. There is nothing in the CUDA programming model to sort this out, which means the measured timing could alternate even run-to-run:
Device Execution: event(start)|Kernel1|Kernel3|Kernel2|event(stop)|
| Duration |
Furthermore, we could considerably alter the actual issue order, placing the issue/launch of Kernel3 before the first cudaEventRecord or after the last cudaEventRecord, and the above argument/variability still holds. This is where the meaning of the asynchronous nature of the cudaEventRecord call comes in. It does not block the CPU thread, but like a kernel launch it is asynchronous. Therefore all of the above activity can issue before any of it actually begins to execute on the device. Even if Kernel3 begins executing before the first cudaEventRecord, it will occupy the device for some time, delaying the beginning of execution of Kernel1, and therefore increasing the measured duration by some amount.
And if the Kernel3 is issued even after the last cudaEventRecord, because all these issue operations are asynchronous, the Kernel3 may still be queued up and ready to go when Kernel1 is complete, meaning the device scheduler can still make a choice about which to launch, making for possibly variable timing.
There are certainly other similar hazards that can be mapped out. This sort of possibility for variation in a multi-streamed scenario is what gives rise to the conservative advice to avoid trying to do cudaEvent based timing using events issued into the non-legacy-default stream.
Of course, if you for example use the visual profiler then there should be relatively little ambiguity about what was measured between two events (although it may still vary run-to-run). However, if you're going to use the visual profiler, you can read the duration directly off the timeline view, without needing an event elapsed time call.
Note that if you override the default stream legacy behavior, the default stream roughly becomes equivalent to an "ordinary" stream (especially for a single-threaded host application). In this case, we can't rely on the default stream semantics to sort this out. One possible option might be to precede any cudaEventRecord() call with a cudaDeviceSynchronize() call. I'm not suggesting this sorts out every possible scenario, but for single-device single host-thread applications, it should be equivalent to cudaEvent timing issued into default legacy stream.
Complex scenario timing might be best done using a profiler. Many folks also dispense entirely with cudaEvent based timing and revert to high-resolution host timing methodologies. In any event, the timing of a complex concurrent asynchronous system is non-trivial. The conservative advice intends to avoid some of these issues for casual use.

Timing of cuda kernel from inside of kernel?

background:
I have a kernel that I measure with windows QPC (264 nanosecond tick rate) at 4ms. But I am a friendly dispute with a colleague running my kernel who claims is takes 15ms+ (we are both doing this after warm-up with a Tesla K40). I suspect his issue is with a custom RHEL, custom cuda drivers, and his "real time " thread groups , but i am not a linux expert. I know windows clocks are less than perfect, but this is too big a discrepancy. (besides it all our timing of other kernels I wrote agree with his timing, it is only the first in the chain of kernels that the time disagrees). Smells to me of something outside the kernel.
question:
Anyway is there a way with CudeDeviceEvents (elapsed time) to add to the CUDA kernel to measure the ENTIRE kernel time from when the first block starts to the end of of the last block? I think this would get us started in figuring out where the problem is. From my reading, it looks like cuda device events are done on the host, and I am looking for something internal to the gpu.
The only way to time execution from entirely within a kernel is to use the clock() and clock64() functions that are covered in the programming guide.
Since these functions sample a per-multiprocessor counter, and AFAIK there is no specified relationship between these counters from one SM to the next, there is no way to determine using these functions alone, which thread/warp/block is "first" to execute and which is "last" to execute, assuming your GPU has more than 1 SM. (Even if there were a specified relationship, such as "they are all guaranteed to be the same value on any given cycle", you would still need additional scaffolding as mentioned below.)
While you could certainly create some additional scaffolding in your code to try to come up with an overall execution time (perhaps adding in atomics to figure out which thread/warp/block is first and last), there may still be functional gaps in the method. Given the difficulty, it seems that the best method, based on what you've described, is simply to use the profilers as discussed by #njuffa in the comments. Any of the profilers can provide you with the execution time of a kernel, on any supported platform, with a trivial set of commands.

CPU and GPU timer in cuda visual profiler

So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program; /bin/time myprog, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.
If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance