How to profile the CUDA application only by nvprof - cuda

I want to write a script to profile my cuda application only using the command tool nvprof. At present, I focus on two metrics: GPU utilization and GPU flops32 (FP32).
GPU utilization is the fraction of the time that the GPU is active. The active time of GPU can be easily obtained by nvprof --print-gpu-trace, while the elapsed time (without overhead) of the application is not clear for me. I use visual profiler nvvp to visualize the profiling results and calculate the GPU utilization. It seems that the elapsed time is the interval between the first and last API call, including the overhead time.
GPU flops32 is the number of FP32 instructions GPU executes per second while it is active. I follow Greg Smith's suggestion (How to calculate Gflops of a kernel) and find that it is very slow for nvprof to generate flop_count_sp_* metrics.
So there are two questions that I want to ask:
How to calculate the elapsed time (without overhead) of a CUDA application using nvprof?
Is there a faster way to obtain the gpu flops32?
Any suggestion would be appreciated.
================ Update =======================
For the first question above, the elapsed time without overhead which I meant is actually session time - overhead time showed in nvvp results:
nvvp results

You can use nVIDIA's NVTX library to programmatically mark named ranges or points on your timeline. The length of such a range, properly defined, would constitute your "elapsed time", and would show up very clearly in the nvvp visualization tool. Here is a "CUDA pro tip" blog post about doing this:
CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
and if you want to do this in a more C++-friendly and RAII way, you can use my CUDA runtime API wrappers, which offer a scoped range marker and other utility functions. Of course, with me being the author, take my recommendation with a grain of salt and see what works for you.
About the "Elapsed time" for the session - that's the time between when you start and stop profiling activity. That can either be when the process comes up, or when you explicitly have profiling start. In my own API wrappers, there's a RAII class for that as well: cuda::profiling::scope or of course you can use the C-style API calls explicitly. (I should really write a sample program doing this, I haven't gotten around to that yet, unfortunately).

Related

Getting total execution time of all kernels on a CUDA stream

I know how to time the execution of one CUDA kernel using CUDA events, which is great for simple cases. But in the real world, an algorithm is often made up of a series of kernels (CUB::DeviceRadixSort algorithms, for example, launch many kernels to get the job done). If you're running your algorithm on a system with a lot of other streams and kernels also in flight, it's not uncommon for the gaps between individual kernel launches to be highly variable based on what other work gets scheduled in-between launches on your stream. If I'm trying to make my algorithm work faster, I don't care so much about how long it spends sitting waiting for resources. I care about the time it spends actually executing.
So the question is, is there some way to do something like the event API and insert a marker in the stream before the first kernel launches, and read it back after your last kernel launches, and have it tell you the actual amount of time spent executing on the stream, rather than the total end-to-end wall-clock time? Maybe something in CUPTI can do this?
You can use Nsight Systems or Nsight Compute.
(https://developer.nvidia.com/tools-overview)
In Nsight Systems, you can profile timelines of each stream. Also, you can use Nsight Compute to profile details of each CUDA kernel. I guess Nsight Compute is better because you can inspect various metrics about GPU performances and get hints about the kernel optimization.

Timing of cuda kernel from inside of kernel?

background:
I have a kernel that I measure with windows QPC (264 nanosecond tick rate) at 4ms. But I am a friendly dispute with a colleague running my kernel who claims is takes 15ms+ (we are both doing this after warm-up with a Tesla K40). I suspect his issue is with a custom RHEL, custom cuda drivers, and his "real time " thread groups , but i am not a linux expert. I know windows clocks are less than perfect, but this is too big a discrepancy. (besides it all our timing of other kernels I wrote agree with his timing, it is only the first in the chain of kernels that the time disagrees). Smells to me of something outside the kernel.
question:
Anyway is there a way with CudeDeviceEvents (elapsed time) to add to the CUDA kernel to measure the ENTIRE kernel time from when the first block starts to the end of of the last block? I think this would get us started in figuring out where the problem is. From my reading, it looks like cuda device events are done on the host, and I am looking for something internal to the gpu.
The only way to time execution from entirely within a kernel is to use the clock() and clock64() functions that are covered in the programming guide.
Since these functions sample a per-multiprocessor counter, and AFAIK there is no specified relationship between these counters from one SM to the next, there is no way to determine using these functions alone, which thread/warp/block is "first" to execute and which is "last" to execute, assuming your GPU has more than 1 SM. (Even if there were a specified relationship, such as "they are all guaranteed to be the same value on any given cycle", you would still need additional scaffolding as mentioned below.)
While you could certainly create some additional scaffolding in your code to try to come up with an overall execution time (perhaps adding in atomics to figure out which thread/warp/block is first and last), there may still be functional gaps in the method. Given the difficulty, it seems that the best method, based on what you've described, is simply to use the profilers as discussed by #njuffa in the comments. Any of the profilers can provide you with the execution time of a kernel, on any supported platform, with a trivial set of commands.

Calculating achieved bandwidth and flops/Gflops, and evaluate CUDA kernel performance

Most of the papers show the flops/Gflops and achieved bandwidth for their CUDA kernels. I have also read answers on stackoverflow for the following questions:
How to evaluate CUDA performance?
How Do You Profile & Optimize CUDA Kernels?
How to calculate Gflops of a kernel
Counting FLOPS/GFLOPS in program - CUDA
How to calculate the achieved bandwidth of a CUDA kernel
Most of the things seem ok, but still does not make me feel comfortable in calculating these things. Can anyone write a simple CUDA kernel? Then give the output of deviceQuery. Then compute step by step the flops/Gflops and achieved bandwidth for this kernel. Then show the Visual Profiler results for this kernel. I.e. show the results in detail with all the information obtained step by step for this simple CUDA kernel. That would be really helpful for most of us. Thanks!
Nsight Visual Studio Edition 2.1 and Above
The information you requested is available if you collect Achieved FLOPS experiment and Memory Statistics - Buffers experiment.
Visual Profiler 4.2 and Above
Achieved Bandwidth: When mouse over a kernel in the Timeline this information the information is available in the Properties Pane under Memory\DRAM Utilization.
The profiler cannot collect FLOPS count yet. This can be done by running cuobjdump -sass to view the assembly code. Step through the kernel and count single and double precision floating points instructions multiplying FMA and DFMA operations by 2. Each instruction should also be multiplied by the predicated true threads. You also have to account for control flow. This is not fun and requires someone with a strong knowlege of the instruction set. This may be better accomplished by single stepping the assembly in the debugger. The duration of the kernel is available in the Visual Profiler Properties Pane and Details Pane as Duration.
You could follow the calculations of Mark Harris in Optimizing Parallel Reductions in CUDA. There he uses the input data as base and divides it through the time of the kernel execution. In the examples he used 2^22 ints so he has 0,016777216 GB of input data. The first kernel took 8,054 ms which is an achieved bandwidth of 2,083 GB/s.
After several optimizations he approached 62,671 GB/s and compares it to the peak performance of the used GPU which is at 86,4 GB/s.
Although he used ints you can easily adapt that to flops/Gflops.

How to calculate total time for CPU + GPU

I am doing some computation on the CPU and then I transfer the numbers to the GPU and do some work there. I want to calculate the total time taken to do the computation on the CPU + the GPU. how do i do so?
When your program starts, in main(), use any system timer to record the time. When your program ends at the bottom of main(), use the same system timer to record the time. Take the difference between time2 and time1. There you go!
There are different system timers you can use, some with higher resolution than others. Rather than discuss those here, I'd suggest you search for "system timer" on the SO site. If you just want any system timer, gettimeofday() works on Linux systems, but it has been superseded by newer, higher-precision functions. As it is, gettimeofday() only measures time in microseconds, which should be sufficient for your needs.
If you can't get a timer with good enough resolution, consider running your program in a loop many times, timing the execution of the loop, and dividing the measured time by the number of loop iterations.
EDIT:
System timers can be used to measure total application performance, including time used during the GPU calculation. Note that using system timers in this way applies only to real, or wall-clock, time, rather than process time. Measurements based on the wall-clock time must include time spent waiting for GPU operations to complete.
If you want to measure the time taken by a GPU kernel, you have a few options. First, you can use the Compute Visual Profiler to collect a variety of profiling information, and although I'm not sure that it reports time, it must be able to (that's a basic profiling function). Other profilers - PAPI comes to mind - offer support for CUDA kernels.
Another option is to use CUDA events to record times. Please refer to the CUDA 4.0 Programming Guide where it discusses using CUDA events to measure time.
Yet another option is to use system timers wrapped around GPU kernel invocations. Note that, given the asynchronous nature of kernel invocation returns, you will also need to follow the kernel invocation with a host-side GPU synchronization call such as cudaThreadSynchronize() for this method to be applicable. If you go with this option, I highly recommend calling the kernel in a loop, timing the loop + one synchronization at the end (since synchronization occurs between kernel calls not executing in different streams, cudaThreadSynchronize() is not needed inside the loop), and dividing by the number of iterations.
The C timer moves on regardless of GPU is working or not. If you don't believe me then do this little experiment: Make a for loop with 1000 iterations over GPU_Function_Call. Put any C timer around that for loop. Now when you run the program (suppose GPU function takes substantial time like 20ms) you will see it running for few seconds with the naked eye before it returns. But when you print the C time you'll notice it'll show you like few miliseconds. This is because the C timer didn't wait for 1000 MemcpyHtoD and 1000 MemcpyfromDtoH and 1000 kernel calls.
What I suggest is to use CUDA event timer or even better NVIDIA Visual Profiler to time GPU and use stop watch (increase the iterations to reduce human error) to measure the complete time. Then just subtract the GPU time from total to get the CPU time.

CPU and GPU timer in cuda visual profiler

So there are 2 timers in cuda visual profiler,
GPU Time: It is the execution time for the method on GPU.
CPU Time:It is sum of GPU time and CPU overhead to launch that Method. At driver generated data level, CPU Time is only CPU overhead to launch the Method for non-blocking Methods; for blocking methods it is sum of GPU time and CPU overhead. All kernel launches by default are non-blocking. But if any profiler counters are enabled kernel launches are blocking. Asynchronous memory copy requests in different streams are non-blocking.
If I have a real program, what's the actual exectuion time? I measure the time, there is a GPU timer and a CPU timer as well, what's the difference?
You're almost there -- now that you're aware of some of the various options, the final step is to ask yourself exactly what time you want to measure. There's no right answer to this, because it depends on what you're trying to do with the measurement. CPU time and GPU time are exactly what you want when you are trying to optimize computation, but they may not include things like waiting that actually can be pretty important. You mention “the actual exectuion time” — that's a start. Do you mean the complete execution time of the problem — from when the user starts the program until the answer is spit out and the program ends? In a way, that's really the only time that actually matters.
For numbers like that, in Unix-type systems I like to just measure the entire runtime of the program; /bin/time myprog, presumably there's a Windows equivalent. That's nice because it's completely unabigious. On the other hand, because it's a total, it's far too broad to be helpful, and it's not much good if your code has a big GUI component, because then you're also measuring the time it takes for the user to click their way to results.
If you want elapsed time of some set of computations, cuda has very handy functions cudaEvent* which can be placed at various parts of the code — see the CUDA Best Practices Guide, s 2.1.2, Using CUDA GPU Timers — these you can put before and after important bits of code and print the results.
gpu timer is based on events.
that means that when an event is created it will be set in a queue at gpu for serving. so there is a small overhead there too.
from what i have measured though the differences are of minor importance